Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help

Open Tabs

  • 031-data-wrangling-with-mongodb.ipynb
  • 032-linear-regression-with-time-series-data.ipynb
  • 033-autoregressive-models.ipynb
  • 034-arma-models-and-hyperparameter-tuning.ipynb
  • 17-ts-core.ipynb
  • 035-assignment.ipynb
  • 04-pandas-advanced.ipynb
  • 18-ts-models.ipynb
  • 11-databases-mongodb.ipynb
  • 10-databases-sql.ipynb

Kernels

  • 10-databases-sql.ipynb
  • 031-data-wrangling-with-mongodb.ipynb
  • 18-ts-models.ipynb
  • 04-pandas-advanced.ipynb
  • 034-arma-models-and-hyperparameter-tuning.ipynb
  • 17-ts-core.ipynb
  • 035-assignment.ipynb
  • 032-linear-regression-with-time-series-data.ipynb
  • 033-autoregressive-models.ipynb

Terminals

    //ds-curriculum/030-air-quality-in-nairobi/
    Name
    ...
    Last Modified
    • .ipynb_checkpoints5 days ago
    • images3 days ago
    • 031-data-wrangling-with-mongodb.ipynb4 days ago
    • 032-linear-regression-with-time-series-data.ipynb12 days ago
    • 033-autoregressive-models.ipynb3 minutes ago
    • 034-arma-models-and-hyperparameter-tuning.ipynb3 days ago
    • 035-assignment.ipynb3 days ago
    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    Databases: PyMongo

    Working with PyMongo¶

    For all of these examples, we're going to be working with the "lagos" collection in the "air-quality" database. Before we can do anything else, we need to bring in pandas (which we won't use until the very end), pprint (a module that lets us see the data in an understandable way), and PyMongo (a library for working with MongoDB databases).

    [1]:
    from pprint import PrettyPrinter

    Databases¶

    Data comes to us in lots of different ways, and one of those ways is in a database. A database is a collection of data.

    Servers and Clients¶

    Next, we need to connect to a server. A database server is where the database resides; it can be accessed using a client. Without a client, a database is just a collection of information that we can't work with, because we have no way in. We're going to be learning more about a database called MongoDB, and we'll use PrettyPrinter to make the information it generates easier to understand. Here's how the connection works:

    [2]:
    client = MongoClient(host="localhost", port=27017)

    Semi-structured Data¶

    Databases are designed to work with either structured data or semi-structured data. In this part of the course, we're going to be working with databases that contain semi-structured data. Data is semi-structured when it has some kind of organizing logic, but that logic can't be displayed using rows and columns. Your email account contains semi-structured data if it’s divided into sections like Inbox, Sent, and Trash. If you’ve ever seen tweets from Twitter grouped by hashtag, that’s semi-structured data too. Semi-structured data is also used in sensor readings, which is what we'll be working with here.

    Exploring a Database¶

    So, now that we're connected to a server, let's take a look at what's there. Working our way down the specificity scale, the first thing we need to do is figure out which databases are on this server. To see which databases on the server, we'll use the list_databases method, like this:

    [3]:
    pp.pprint(list(client.list_databases()))
    [ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
      {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
      {'empty': False, 'name': 'config', 'sizeOnDisk': 98304},
      {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
      {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    

    It looks like this server contains four databases: "admin", "air-quality", "config", and "local". We're only interested in "air-quality", so let's connect to that one:

    [4]:
    db = client["air-quality"]

    In MongoDB, a database is a container for collections. Each database gets its own set of files, and a single MongoDB server typically has multiple databases.

    Collections¶

    Let's use a for loop to take a look at the collections in the "air-quality" database:

    [5]:
    for c in db.list_collections():
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    

    As you can see, there are three actual collections here: "nairobi", "lagos", and "dar-es-salaam". Since we're only interested in the "lagos" collection, let's get it on its own like this:

    [6]:
    lagos = db["lagos"]

    Documents¶

    A MongoDB document is an individual record of data in a collection, and is the basic unit of analysis in MongoDB. Documents come with metadata that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the count_documents method to see how many documents the "lagos" collection contains:

    [7]:
    lagos.count_documents({})
    [7]:
    166496

    Practice

    Try it yourself! Bring in all the necessary libraries and modules, then connect to the "air-quality" database and print the number of documents in the "nairobi" collection.

    [8]:
    client = MongoClient(host = "localhost", port = 27017)
    [    {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
         {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
         {'empty': False, 'name': 'config', 'sizeOnDisk': 98304},
         {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
         {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [8]:
    202212

    Retrieving Data¶

    Now that we know how many documents the "lagos" collection contains, let's take a closer look at what's there. The first thing you'll notice is that the output starts out with a curly bracket ({), and ends with a curly bracket (}). That tells us that this information is a dictionary. To access documents in the collection, we'll use two methods: find and find_one. As you might expect, find will retrieve all the documents, and find_one will bring back only the first document. For now, let's stick to find_one; we'll come back to find later.

    Just like everywhere else, we'll need to assign a variable name to whatever comes back, so let's call this one result.

    [9]:
    result = lagos.find_one({})
    {    '_id': ObjectId('6334b0f18c51459f9b1d955d'),
         'metadata': {    'lat': 6.501,
                          'lon': 3.367,
                          'measurement': 'temperature',
                          'sensor_id': 10,
                          'sensor_type': 'DHT11',
                          'site': 4},
         'temperature': nan,
         'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 88000)}
    

    Key-Value Pairs¶

    There's a lot going on here! Let's work from the bottom up, starting with this:

    {
        'temperature': 27.0,
        'timestamp': datetime.datetime(2017, 9, 6, 13, 18, 10, 120000)
    }
    

    The actual data is labeled temperature and timestamp, and if seeing it presented this way seems familiar, that's because what you're seeing at the bottom are two key-value pairs. In PyMongo, "_id" is always the primary key. Primary keys are the column(s) which contain values that uniquely identify each row in a table; we'll talk about that more in a minute.

    Metadata¶

    Next, we have this:

    'metadata': { 'lat': 6.602,
                  'lon': 3.351,
                  'measurement': 'temperature',
                  'sensor_id': 9,
                  'sensor_type': 'DHT11',
                  'site': 2}
    

    This is the document's metadata. Metadata is data about the data. If you’re working with a database, its data is the information it contains, and its metadata describes what that information is. In MongoDB, each document often has metadata of its own. If we go back to the example of your email account, each message in your Sent folder includes both the message itself and information about when you sent it and who you sent it to; the message is data, and the other information is metadata.

    The metadata we see in this block of code tells us what the key-value pairs from the last code block mean, and where the information stored there comes from. There's location data, a line telling us what about the format of the key-value pairs, some information about the equipment used to gather the data, and where the data came from.

    Identifiers¶

    Finally, at the top, we have this:

    { 
        '_id': ObjectId('6126f1780e45360640bf240a')
    }
    

    This is the document's unique identifier, which is similar to the index label for each row in a pandas DataFrame.

    Practice

    Try it yourself! Retrieve a single document from the "nairobi" collection, and print the result.

    [10]:
    result = nairobi.find_one({})
    {    'P1': 39.67,
         '_id': ObjectId('6334b0e98c51459f9b198d27'),
         'metadata': {    'lat': -1.3,
                          'lon': 36.785,
                          'measurement': 'P1',
                          'sensor_id': 57,
                          'sensor_type': 'SDS011',
                          'site': 29},
         'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
    

    Analyzing Data¶

    Now that we've seen what a document looks like in this collection, let's start working with what we've got. Since our metadata includes information about each sensor's "site", we might be curious to know how many sites are in the "lagos" collection. To do that, we'll use the distinct method, like this:

    [11]:
    lagos.distinct("metadata.site")
    [11]:
    [3, 4]

    Notice that in order to grab the "site" number, we needed to include the "metadata" tag.

    This tells us that there are 2 sensor sites in Lagos: one labeled 3 and the other labeled 4.

    Let's go further. We know that there are two sensor sites in Lagos, but we don't know how many documents are associated with each site. To find that out, we'll use the count_documents method for each site.

    [12]:
    print("Documents from site 3:", lagos.count_documents({"metadata.site": 3}))
    Documents from site 3: 140586
    Documents from site 4: 25910
    

    Practice

    Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, and how many documents are associated with each one.

    [13]:
    print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))
    Documents from site 29: 131852
    Documents from site 6: 70360
    
    [14]:
    print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))
    Documents from site 29: 131852
    Documents from site 6: 70360
    

    Now that we know how many documents are associated with each site, let's keep drilling down and find the number of readings for each site. We'll do this with the aggregate method.

    Before we run it, let's take a look at some of what's happening in the code here. First, you'll notice that there are several dollar signs ($) in the list. This is telling the collection that we want to create something new. Here, we're saying that we want there to be a new group, and that the new group needs to be updated with data from metadata.site, and then updated again with data from count.

    There's also a new field: "_id". In PyMongo, "_id" is always the primary key. Primary keys are the fields which contain values that uniquely identify each row in a table.

    Let's run the code and see what happens:

    [15]:
        [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
    [{'_id': 4, 'count': 25910}, {'_id': 3, 'count': 140586}]
    

    With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the P2 values in the "lagos" collection. P2 measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the find method and use limit to make sure we only get back the first 3.

    [16]:
    result = lagos.find({"metadata.measurement": "P2"}).limit(3)
    [    {    'P2': 14.42,
              '_id': ObjectId('6334b0f28c51459f9b1de145'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
         {    'P2': 19.66,
              '_id': ObjectId('6334b0f28c51459f9b1de146'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
         {    'P2': 24.79,
              '_id': ObjectId('6334b0f28c51459f9b1de147'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
    

    Practice

    Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, how many documents are associated with each one, and the number of observations from each site. Then, return the first three documents with the value P2.

    [29]:
        [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
    [{'_id': 29, 'count': 131852}, {'_id': 6, 'count': 70360}]
    [    {    'P2': 14.42,
              '_id': ObjectId('6334b0f28c51459f9b1de145'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
         {    'P2': 19.66,
              '_id': ObjectId('6334b0f28c51459f9b1de146'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
         {    'P2': 24.79,
              '_id': ObjectId('6334b0f28c51459f9b1de147'),
              'metadata': {    'lat': 6.501,
                               'lon': 3.367,
                               'measurement': 'P2',
                               'sensor_id': 6,
                               'sensor_type': 'PPD42NS',
                               'site': 4},
              'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
    

    So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using distinct to remind ourselves of the kinds of data we have at our disposal.

    [18]:
    lagos.distinct("metadata.measurement")
    [18]:
    ['temperature', 'P1', 'humidity', 'P2']

    There are also comparison query operators that can be helpful for filtering the data. In total, we have

    • $gt: greater than (>)
    • $lt: less than (<)
    • $gte: greater than equal to (>=)
    • $lte: less than equal to (<= )

    Let's use the timestamp to see how we can use these operators to select different documents:

    [19]:
    result = nairobi.find({"timestamp": {"$gt": datetime.datetime(2018, 9, 1)}}).limit(3)
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P1': 39.13,
              '_id': ObjectId('6334b0e98c51459f9b198d28'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
         {    'P1': 30.07,
              '_id': ObjectId('6334b0e98c51459f9b198d29'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [20]:
    result = nairobi.find({"timestamp": {"$lt": datetime.datetime(2018, 12, 1)}}).limit(3)
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P1': 39.13,
              '_id': ObjectId('6334b0e98c51459f9b198d28'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
         {    'P1': 30.07,
              '_id': ObjectId('6334b0e98c51459f9b198d29'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [21]:
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P2': 34.43,
              '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P2',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    

    Practice

    Try it yourself! Find three documents with timestamp greater than or equal to and less than or equal the date December 12, 2018 — datetime.datetime(2018, 12, 1, 0, 0, 6, 767000).

    [33]:
    result = nairobi.find({"timestamp":{"$gte":datetime.datetime(2018,12,1,0,0,6,767000)}}).limit(3)
    [    {    'P1': 17.08,
              '_id': ObjectId('6334b0e98c51459f9b19eba8'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 0, 6, 767000)},
         {    'P1': 17.62,
              '_id': ObjectId('6334b0e98c51459f9b19eba9'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 5, 6, 327000)},
         {    'P1': 11.05,
              '_id': ObjectId('6334b0e98c51459f9b19ebaa'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 12, 1, 0, 10, 5, 579000)}]
    
    [ ]:
    # Less than or equal to

    Updating Documents¶

    We can also update documents by passing some filter and new values using update_one to update one record or update_many to update many records. Let's look at an example. Before updating, we have this record showing like this:

    [34]:
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    

    Now we are updating the sensor type from "SDS011" to "SDS", we first select all records with sensor type equal to "SDS011", then set the new value to "SDS":

    [35]:
        {"metadata.sensor_type": {"$eq": "SDS101"}},

    Now we can see all records have changed:

    [36]:
        {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
    [    {    'P1': 39.67,
              '_id': ObjectId('6334b0e98c51459f9b198d27'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P1',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
         {    'P2': 34.43,
              '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
              'metadata': {    'lat': -1.3,
                               'lon': 36.785,
                               'measurement': 'P2',
                               'sensor_id': 57,
                               'sensor_type': 'SDS011',
                               'site': 29},
              'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    

    We can change it back:

    [37]:
        {"$set": {"metadata.sensor_type": "SDS101"}},
    [38]:
    result.raw_result
    [38]:
    {'n': 0, 'nModified': 0, 'ok': 1.0, 'updatedExisting': False}

    Aggregation¶

    Since we're looking for big numbers, we need to figure out which one of those dimensions has the largest number of measurements by aggregating the data in each document. Since we already know that site 3 has significantly more documents than site 2, let's start looking at site 3. We can use the $match syntax to only select site 3 data. The code to do that looks like this:

    [39]:
            {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}},
    [    {'_id': 'temperature', 'count': 35147},
         {'_id': 'P1', 'count': 35146},
         {'_id': 'P2', 'count': 35146},
         {'_id': 'humidity', 'count': 35147}]
    

    Practice

    Try it yourself! Find the number of each measurement type at site 29 in Nairobi.

    [ ]:
    pp.pprint(list(result))

    After aggregation, there is another useful operator called $project, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:WQU WorldQuant University Applied Data Science Lab QQQQ

    [42]:
            {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
    [{'_id': 'DHT22', 'count': 66038}, {'_id': 'SDS011', 'count': 65814}]
    

    We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the count filed by setting it at 0 in $project:

    [43]:
            {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
    [{'_id': 'DHT22'}, {'_id': 'SDS011'}]
    

    The $project syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date.

    [44]:
                    "date_min": {"$min": "$timestamp"},
    [    {    '_id': 'DHT22',
              'date_max': datetime.datetime(2018, 12, 31, 23, 57, 37, 257000),
              'date_min': datetime.datetime(2018, 9, 1, 0, 0, 4, 301000)},
         {    '_id': 'SDS011',
              'date_max': datetime.datetime(2018, 12, 31, 23, 55, 4, 811000),
              'date_min': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
    

    Then we can calculate the date difference using $dateDiff, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a "dateDiff" field inside $project, so that it will be shown in the final display:

    [45]:
                    "date_min": {"$min": "$timestamp"},
    [{'_id': 'DHT22', 'dateDiff': 175677}, {'_id': 'SDS011', 'dateDiff': 175675}]
    

    If we specify unit as day, it will show the difference between the dates:

    [ ]:
                    "date_min": {"$min": "$timestamp"},

    Practice

    Try it yourself find the date difference for each measurement type at site 29 in Nairobi.

    [ ]:
    pp.pprint(list(result))

    We can do more with the date data using $dateTrunc, which truncates datetime data. We need to specify the datetime data, which can be a Date, a Timestamp, or an ObjectID. Then we need to specify the unit (year, month, day, hour, minute, second) and binSize (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using $dateTrunc and then count how many observations there are for each month.

    [ ]:
                                "date": "$timestamp",

    Practice

    Try it yourself! Truncate date by week and count at site 29 in Nairobi.

    [ ]:
    pp.pprint(list(result))

    Finishing Up¶

    So far, we've connected to a server, accessed that server with a client, found the collection we were looking for within a database, and explored that collection in all sorts of different ways. Now it's time to get the data we'll actually need to build a model, and store that in a way we'll be able to use.

    Let's use find to retrieve the PM 2.5 data from site 3. And, since we don't need any of the metadata to build our model, let's strip that out using the projection argument. In this case, we're telling the collection that we only want to see "timestamp" and "P2". Keep in mind that we limited the number of records we'll get back to 3 when we defined result above.

    [40]:
        # `projection` limits the kinds of data we'll get back.
    {'P2': 0.01, 'timestamp': datetime.datetime(2018, 10, 1, 0, 0, 52, 906000)}
    

    Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set timestamp as the index:

    [41]:
    df = pd.DataFrame(result).set_index("timestamp")
    [41]:
    P2
    timestamp
    2018-10-01 00:04:20.554 0.01
    2018-10-01 00:07:47.504 0.01
    2018-10-01 00:11:14.382 0.01
    2018-10-01 00:14:41.239 0.01
    2018-10-01 00:18:05.938 0.01

    Practice

    Try it yourself! Retrieve the PM 2.5 data from site 29 in Nairobi and strip out the metadata to create a DataFrame that shows only timestamp and P2. Print the result.

    [ ]:
    result = ...

    References & Further Reading¶

    • Further reading about servers and clients
    • Definitions from the MongoDB documentation
    • Information on Iterators
    • MongoDB documentation in Aggregation

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    3.1. Wrangling Data with MongoDB 🇰🇪

    [1]:
    from IPython.display import VimeoVideo
    [2]:
    VimeoVideo("665412094", h="8334dfab2e", width=600)
    [2]:
    [3]:
    VimeoVideo("665412135", h="dcff7ab83a", width=600)
    [3]:

    Task 3.1.1: Instantiate a PrettyPrinter, and assign it to the variable pp.

    • Construct a PrettyPrinter instance in pprint.
    [4]:
    pp = PrettyPrinter(indent = 2)

    Prepare Data¶

    Connect¶

    [5]:
    VimeoVideo("665412155", h="1ca0dd03d0", width=600)
    [5]:

    Task 3.1.2: Create a client that connects to the database running at localhost on port 27017.

    • What's a database client?
    • What's a database server?
    • Create a client object for a MongoDB instance.
    [6]:
    client = MongoClient(host="localhost",port=27017)

    Explore¶

    [7]:
    VimeoVideo("665412176", h="6fea7c6346", width=600)
    [7]:

    Task 3.1.3: Print a list of the databases available on client.

    • What's an iterator?
    • List the databases of a server using PyMongo.
    • Print output using pprint.
    [8]:
    pp.pprint(list(client.list_databases()))
    [ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
      {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
      {'empty': False, 'name': 'config', 'sizeOnDisk': 110592},
      {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
      {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
    
    [9]:
    VimeoVideo("665412216", h="7d4027dc33", width=600)
    [9]:

    Task 3.1.4: Assign the "air-quality" database to the variable db.

    • What's a MongoDB database?
    • Access a database using PyMongo.
    [10]:
    db = client["air-quality"]
    [11]:
    VimeoVideo("665412231", h="89c546b00f", width=600)
    [11]:

    Task 3.1.5: Use the list_collections method to print a list of the collections available in db.

    • What's a MongoDB collection?
    • List the collections in a database using PyMongo.
    [12]:
    for c in db.list_collections():
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [13]:
    VimeoVideo("665412252", h="bff2abbdc0", width=600)
    [13]:

    Task 3.1.6: Assign the "nairobi" collection in db to the variable name nairobi.

    • Access a collection in a database using PyMongo.
    [14]:
    nairobi = db["nairobi"]
    [15]:
    VimeoVideo("665412270", h="e4a5f5c84b", width=600)
    [15]:

    Task 3.1.7: Use the count_documents method to see how many documents are in the nairobi collection.

    • What's a MongoDB document?
    • Count the documents in a collection using PyMongo.
    [16]:
    nairobi.count_documents({})
    [16]:
    202212
    [17]:
    VimeoVideo("665412279", h="c2315f3be1", width=600)
    [17]:

    Task 3.1.8: Use the find_one method to retrieve one document from the nairobi collection, and assign it to the variable name result.

    • What's metadata?
    • What's semi-structured data?
    • Retrieve a document from a collection using PyMongo.
    [18]:
    result = nairobi.find_one({})
    { 'P1': 39.67,
      '_id': ObjectId('6334b0e98c51459f9b198d27'),
      'metadata': { 'lat': -1.3,
                    'lon': 36.785,
                    'measurement': 'P1',
                    'sensor_id': 57,
                    'sensor_type': 'SDS011',
                    'site': 29},
      'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
    
    [19]:
    VimeoVideo("665412306", h="e1e913dfd1", width=600)
    [19]:

    Task 3.1.9: Use the distinct method to determine how many sensor sites are included in the nairobi collection.

    • Get a list of distinct values for a key among all documents using PyMongo.
    [20]:
    nairobi.distinct("metadata.site")
    [20]:
    [6, 29]
    [21]:
    VimeoVideo("665412322", h="4776c6d548", width=600)
    [21]:

    Task 3.1.10: Use the count_documents method to determine how many readings there are for each site in the nairobi collection.

    • Count the documents in a collection using PyMongo. WQU WorldQuant University Applied Data Science Lab QQQQ
    [22]:
    print("Documents from site 29:",nairobi.count_documents({"metadata.site":29}))
    Documents from site 6: 70360
    Documents from site 29: 131852
    
    [23]:
    VimeoVideo("665412344", h="d2354584cd", width=600)
    [23]:

    Task 3.1.11: Use the aggregate method to determine how many readings there are for each site in the nairobi collection.

    • Perform aggregation calculations on documents using PyMongo.
    [24]:
            {"$group":{"_id":"$metadata.site","count":{"$count":{}}}}
    [{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
    
    [25]:
    VimeoVideo("665412372", h="565122c9cc", width=600)
    [25]:

    Task 3.1.12: Use the distinct method to determine how many types of measurements have been taken in the nairobi collection.

    • Get a list of distinct values for a key among all documents using PyMongo.
    [26]:
    nairobi.distinct("metadata.measurement")
    [26]:
    ['P2', 'humidity', 'P1', 'temperature']
    [27]:
    VimeoVideo("665412380", h="f7f7a39bb3", width=600)
    [27]:

    Task 3.1.13: Use the find method to retrieve the PM 2.5 readings from all sites. Be sure to limit your results to 3 records only.

    • Query a collection using PyMongo.
    [28]:
    result = nairobi.find({"metadata.measurement":"P2"}).limit(3)
    [ { 'P2': 34.43,
        '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
        'metadata': { 'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P2',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
        'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
      { 'P2': 30.53,
        '_id': ObjectId('6334b0ea8c51459f9b1a0db3'),
        'metadata': { 'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P2',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
        'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
      { 'P2': 22.8,
        '_id': ObjectId('6334b0ea8c51459f9b1a0db4'),
        'metadata': { 'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P2',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
        'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
    
    [29]:
    VimeoVideo("665412389", h="8976ea3090", width=600)
    [29]:

    Task 3.1.14: Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 6.

    • Perform aggregation calculations on documents using PyMongo.
    [30]:
            {"$group":{"_id":"$metadata.measurement","count":{"$count":{}}}}
    [ {'_id': 'P2', 'count': 18169},
      {'_id': 'humidity', 'count': 17011},
      {'_id': 'P1', 'count': 18169},
      {'_id': 'temperature', 'count': 17011}]
    
    [31]:
    VimeoVideo("665412418", h="0c4b125254", width=600)
    [31]:

    Task 3.1.15: Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 29.

    • Perform aggregation calculations on documents using PyMongo.
    [32]:
            {"$group":{"_id":"$metadata.measurement","count":{"$count":{}}}}
    [ {'_id': 'P2', 'count': 32907},
      {'_id': 'humidity', 'count': 33019},
      {'_id': 'P1', 'count': 32907},
      {'_id': 'temperature', 'count': 33019}]
    []
    

    Import¶

    [33]:
    VimeoVideo("665412437", h="7a436c7e7e", width=600)
    [33]:

    Task 3.1.16: Use the find method to retrieve the PM 2.5 readings from site 29. Be sure to limit your results to 3 records only. Since we won't need the metadata for our model, use the projection argument to limit the results to the "P2" and "timestamp" keys only.

    • Query a collection using PyMongo.
    [34]:
        {"metadata.site":29,"metadata.measurement":"P2"},
    [35]:
    VimeoVideo("665412442", h="494636d1ea", width=600)
    [35]:

    Task 3.1.17: Read records from your result into the DataFrame df. Be sure to set the index to "timestamp".

    • Create a DataFrame from a dictionary using pandas.
    [36]:
    df = pd.DataFrame(result).set_index("timestamp")
    [36]:
    P2
    timestamp
    2018-09-01 00:00:02.472 34.43
    2018-09-01 00:05:03.941 30.53
    2018-09-01 00:10:04.374 22.80
    2018-09-01 00:15:04.245 13.30
    2018-09-01 00:20:04.869 16.57
    [37]:
    assert isinstance(df.index, pd.DatetimeIndex), "`df` should have a `DatetimeIndex`."

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    Databases: SQL

    [ ]:
    from IPython.display import YouTubeVideo

    Working with SQL Databases¶

    A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a relational database. A relational database is a widely used database model that consists of a collection of uniquely named tables used to store information. The structure of a database model with its tables, constraints, and relationships is called a schema.

    A Structured Query Language (SQL), is used to retrieve information from a relational database. SQL is one of the most commonly used database languages. It allows data stored in a relational database to be queried, modified, and manipulated easily with basic commands. SQL powers database engines like MySQL, SQL Server, SQLite, and PostgreSQL. The examples and projects in this course will use SQLite.

    A table refers to a collection of rows and columns in a relational database. When reading data into a pandas DataFrame, an index can be defined, which acts as the label for every row in the DataFrame.

    Connecting to a Database¶

    ipython-sql¶

    Magic Commands¶

    Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

    Magic Command Description of Command
    %pwd Print the current working directory
    %cd Change the current working directory
    %ls List the contents of the current directory
    %history Show the history of the In [ ]: commands

    We will be leveraging magic commands to work with a SQLite database.

    ipython-sql¶

    ipython-sql allows you to write SQL code directly in a Jupyter Notebook. The %sql (or %%sql) magic command is added to the beginning of a code block and then SQL code can be written.

    Connecting with ipython-sql¶

    We can connect to a database using the %sql magic function:

    [ ]:
    %sql sqlite:////home/jovyan/nepal.sqlite

    sqlite3¶

    We can also connect to the same database using the sqlite3 package:

    [ ]:
    conn = sqlite3.connect("/home/jovyan/nepal.sqlite")

    Querying a Database¶

    Building Blocks of the Basic Query¶

    There are six common clauses used for querying data:

    Clause Name Definition
    SELECT Determines which columns to include in the query's result
    FROM Identifies the table from which to query the data from
    WHERE filters data
    GROUP BY groups rows by common values in columns
    HAVING filters out unwanted groups from GROUP BY
    ORDER BY Orders the rows using one or more columns
    LIMIT Outputs the specified number of rows

    All clauses may be used together, but SELECT and FROM are the only required clauses. The format of clauses is in the example query below:

    SELECT column1, column2
    FROM table_name
    WHERE "conditions"
    GROUP BY "column-list"
    HAVING "conditions"
    ORDER BY "column-list"
    

    SELECT and FROM¶

    You can use SELECT * to select all columns in a table. FROM specifies the table in the database to query. LIMIT 5 will select only the first five rows.

    Example

    [ ]:
    FROM id_map

    You can also use SELECT to select certain columns in a table

    [ ]:
    SELECT household_id,

    Practice

    Try it yourself! Use SELECT to select the district_id column from the id_map table.

    [ ]:
    %%sql

    We can also assign an alias or temporary name to a column using the AS command. Aliases can also be used on a table. See the example below, which assigns the alias household_number to household_id

    [ ]:
    SELECT household_id AS household_number,

    Practice

    Try it yourself! Use SELECT, FROM, AS, and LIMIT to select the first 5 rows from the id_map table. Rename the district_id column to district_number.

    [ ]:
    %%sql

    Filtering and Sorting Data¶

    SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data.

    Comparison Operator Description
    = Equal
    > Greater than
    < Less than
    >= Greater than or equal to
    <= Less than or equal to
    <> or != Not equal to
    LIKE String comparison test

    For example, to select the first 5 homes in Ramechhap (district 2):

    [ ]:
    %%sql

    Practice

    Try it yourself! Use WHERE to select the row with household_id equal to 13735001

    [ ]:
    %%sql

    Aggregating Data¶

    Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

    Aggregation Function Definition
    MIN Return the minimum value
    MAX Return the largest value
    SUM Return the sum of values
    AVG Return the average of values
    COUNT Return the number of observations

    Use the COUNT function to find the number of observations in the id_map table that come from Ramechhap (district 2):

    [ ]:
    WHERE district_id = 2

    Aggregation functions are frequently used with a GROUP BY clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

    [ ]:
    GROUP BY district_id

    DISTINCT is a keyword to select unique records in a query result. For example, if we want to know the unique values in the district_id column:

    [ ]:
    SELECT distinct(district_id)

    Practice

    Try it yourself! Use DISTINCT to count the number of unique values in the vdcmun_id column.

    [ ]:
    %%sql

    DISTINCT and COUNT can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the district_id column:

    [ ]:
    SELECT count(distinct(district_id))

    Practice

    Try it yourself! Use DISTINCT and COUNT to count the number of unique values in the vdcmun_id column.

    [ ]:
    %%sql

    Joining Tables¶

    Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where table1 and table2 refer to the two tables being joined, column1 and column2 refer to columns to be returned from both tables, and ID refers to the common column in the two tables.

    SELECT table1.column1,
           table2.column2
    FROM table_1
    JOIN table2 ON table1.id = table1.id
    

    We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding foundation_type for the first home in Ramechhap (District 1). We'll start by looking at this single record in the id_map table.

    [ ]:
    WHERE district_id = 2

    This household has building_id equal to 23. Let's look at the foundation_type for this building, by filtering the building_structure table to find this building.

    [ ]:
    FROM building_structure

    To join the two tables and limit the results to building_id = 23:

    [ ]:
    JOIN building_structure ON id_map.building_id = building_structure.building_id

    In addition to the basic JOIN clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the FROM clause and the right table is the table specified after the JOIN clause. If the generic JOIN clause is used, then by default the INNER JOIN will be used.

    JOIN Type Definition
    INNER JOIN Returns rows where ID is in both tables
    LEFT JOIN Returns rows where ID is in the left table. Return NA for values in column, if ID is not in right table.
    RIGHT JOIN Returns rows where ID is in the right table. Return NA for values in column, if ID is not in left table.
    FULL JOIN Returns rows where ID is in either table. Return NA for values in column, if ID is not in either table.
    WQU WorldQuant University Applied Data Science Lab QQQQ

    The video below outlines the main types of joins:

    [ ]:
    YouTubeVideo("2HVMiPPuPIM")

    Practice

    Try it yourself! Use the DISTINCT command to create a column with all unique building IDs in the id_map table. LEFT JOIN this column with the roof_type column from the building_structure table, showing only buildings where district_id is 1 and limiting your results to the first five rows of the new table.

    [ ]:
    %%sql

    Using pandas with SQL Databases¶

    To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

    [ ]:
    conn = sqlite3.connect("/home/jovyan/nepal.sqlite")

    To run a query using sqlite3, we need to store the query as a string. For example, the variable below called query is a string containing a query which returns the first 10 rows from the id_map table:

    [ ]:
        FROM id_map

    To save the results of the query into a pandas DataFrame, use the pd.read_sql() function. The optional parameter index_col can be used to set the index to a specific column from the query.

    [ ]:
    df = pd.read_sql(query, conn, index_col="building_id")

    Practice

    Try it yourself! Use the pd.read_sql function to save the results of a query to a DataFrame. The query should select first 20 rows from the id_map table.

    [ ]:
    query = ...

    References & Further Reading¶

    • Additional Explanation of Magic Commands
    • ipython-SQL User Documentation
    • Data Carpentry Course on SQL in Python
    • SQL Course Material on GitHub (1)
    • SQL Course Material on GitHub (2)

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    3.2. Linear Regression with Time Series Data

    [1]:
    from sklearn.linear_model import LinearRegression
    [2]:
    VimeoVideo("665412117", h="c39a50bd58", width=600)
    [2]:

    Prepare Data¶

    Import¶

    [3]:
    VimeoVideo("665412469", h="135f32c7da", width=600)
    [3]:

    Task 3.2.1: Complete to the create a client to connect to the MongoDB server, assign the "air-quality" database to db, and assign the "nairobi" connection to nairobi.

    • Create a client object for a MongoDB instance.
    • Access a database using PyMongo.
    • Access a collection in a database using PyMongo.
    [4]:
    client = MongoClient(host="localhost",port=27017)
    [5]:
    VimeoVideo("665412480", h="c20ed3e570", width=600)
    [5]:

    Task 3.2.2: Complete the wrangle function below so that the results from the database query are read into the DataFrame df. Be sure that the index of df is the "timestamp" from the results.

    • Create a DataFrame from a dictionary using pandas.
    [6]:
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    [7]:
    VimeoVideo("665412496", h="d757475f7c", width=600)
    [7]:

    Task 3.2.3: Use your wrangle function to read the data from the nairobi collection into the DataFrame df.

    [8]:
    df = wrangle(nairobi)
    [8]:
    P2 P2.L1
    timestamp
    2018-09-01 04:00:00+03:00 15.800000 17.541667
    2018-09-01 05:00:00+03:00 11.420000 15.800000
    2018-09-01 06:00:00+03:00 11.614167 11.420000
    2018-09-01 07:00:00+03:00 17.665000 11.614167
    2018-09-01 08:00:00+03:00 21.016667 17.665000
    2018-09-01 09:00:00+03:00 22.589167 21.016667
    2018-09-01 10:00:00+03:00 18.605833 22.589167
    2018-09-01 11:00:00+03:00 14.022500 18.605833
    2018-09-01 12:00:00+03:00 13.150000 14.022500
    2018-09-01 13:00:00+03:00 12.806667 13.150000
    [9]:
    assert any([isinstance(df, pd.DataFrame), isinstance(df, pd.Series)])
    [10]:
    VimeoVideo("665412520", h="e03eefff07", width=600)
    [10]:

    Task 3.2.4: Add to your wrangle function so that the DatetimeIndex for df is localized to the correct timezone, "Africa/Nairobi". Don't forget to re-run all the cells above after you change the function.

    • Localize a timestamp to another timezone using pandas.
    [11]:
    # df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    [12]:
    assert df.index.tzinfo == pytz.timezone("Africa/Nairobi")

    Explore¶

    [13]:
    VimeoVideo("665412546", h="97792cb982", width=600)
    [13]:

    Task 3.2.5: Create a boxplot of the "P2" readings in df.

    • Create a boxplot using pandas.
    [14]:
    df["P2"].plot(kind="box",vert=False,ax=ax)
    [14]:
    <AxesSubplot:>
    [15]:
    VimeoVideo("665412573", h="b46049021b", width=600)
    [15]:

    Task 3.2.6: Add to your wrangle function so that all "P2" readings above 500 are dropped from the dataset. Don't forget to re-run all the cells above after you change the function.

    • Subset a DataFrame with a mask using pandas.
    [16]:
    assert len(df) <= 32906
    [17]:
    VimeoVideo("665412594", h="e56c2f6839", width=600)
    [17]:

    Task 3.2.7: Create a time series plot of the "P2" readings in df.

    • Create a line plot using pandas.
    [18]:
    fig,ax = plt.subplots(figsize=(15, 6))
    [19]:
    VimeoVideo("665412601", h="a16c5a73fc", width=600)
    [19]:

    Task 3.2.8: Add to your wrangle function to resample df to provide the mean "P2" reading for each hour. Use a forward fill to impute any missing values. Don't forget to re-run all the cells above after you change the function.

    • Resample time series data in pandas.
    • Impute missing time series values using pandas.
    [20]:
    # df["P2"].resample("1H").mean().fillna(method="ffill").to_frame() #.isnull().sum()
    [21]:
    assert len(df) <= 2928
    [22]:
    VimeoVideo("665412649", h="d2e99d2e75", width=600)
    [22]:

    Task 3.2.9: Plot the rolling average of the "P2" readings in df. Use a window size of 168 (the number of hours in a week).

    • What's a rolling window?
    • Do a rolling window calculation in pandas.
    • Make a line plot with time series data in pandas.
    [23]:
    df["P2"].rolling(168).mean().plot(ax=ax);
    [24]:
    VimeoVideo("665412693", h="c3bca16aff", width=600)
    [24]:

    Task 3.2.10: Add to your wrangle function to create a column called "P2.L1" that contains the mean"P2" reading from the previous hour. Since this new feature will create NaN values in your DataFrame, be sure to also drop null rows from df.

    • Shift the index of a Series in pandas.
    • Drop rows with missing values from a DataFrame using pandas.
    [25]:
    assert len(df) <= 11686
    [26]:
    VimeoVideo("665412732", h="059e4088c5", width=600)
    [26]:

    Task 3.2.11: Create a correlation matrix for df.

    • Create a correlation matrix in pandas.
    [27]:
    df.corr()
    [27]:
    P2 P2.L1
    P2 1.000000 0.650679
    P2.L1 0.650679 1.000000
    [28]:
    VimeoVideo("665412741", h="7439cb107c", width=600)
    [28]:

    Task 3.2.12: Create a scatter plot that shows PM 2.5 mean reading for each our as a function of the mean reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and "P2" should be on the y-axis. Don't forget to label your axes!

    • Create a scatter plot using Matplotlib.
    [29]:
    ax.plot([0,120],[0,120],linestyle="--",color="orange")
    [29]:
    [<matplotlib.lines.Line2D at 0x7f7eae8ce940>]

    Split¶

    [30]:
    VimeoVideo("665412762", h="a5eba496f7", width=600)
    [30]:

    Task 3.2.13: Split the DataFrame df into the feature matrix X and the target vector y. Your target is "P2".

    • Subset a DataFrame by selecting one or more columns in pandas.
    • Select a Series from a DataFrame in pandas.
    [35]:
    feature=["P2.L1"]
    [35]:
    P2.L1
    timestamp
    2018-09-01 04:00:00+03:00 17.541667
    2018-09-01 05:00:00+03:00 15.800000
    2018-09-01 06:00:00+03:00 11.420000
    2018-09-01 07:00:00+03:00 11.614167
    2018-09-01 08:00:00+03:00 17.665000
    [36]:
    VimeoVideo("665412785", h="03118eda71", width=600)
    [36]:

    Task 3.2.14: Split X and y into training and test sets. The first 80% of the data should be in your training set. The remaining 20% should be in the test set.

    • Divide data into training and test sets in pandas.
    [37]:
    X_train, y_train = X.iloc[:cutoff],y.iloc[:cutoff]
    [38]:
    len(X_train)+len(X_test)==len(X)
    [38]:
    True

    Build Model¶

    Baseline¶

    Task 3.2.15: Calculate the baseline mean absolute error for your model.

    • Calculate summary statistics for a DataFrame or Series in pandas.
    [45]:
    mae_baseline = mean_absolute_error(y_train,y_pred_baseline)
    Mean P2 Reading: 9.27
    Baseline MAE: 3.89
    

    Iterate¶

    Task 3.2.16: Instantiate a LinearRegression model named model, and fit it to your training data.

    • Instantiate a predictor in scikit-learn.
    • Fit a model to training data in scikit-learn.
    [47]:
    model = LinearRegression()
    [47]:
    LinearRegression()
    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
    LinearRegression()

    Evaluate¶

    [48]:
    VimeoVideo("665412844", h="129865775d", width=600)
    [48]:

    Task 3.2.17: Calculate the training and test mean absolute error for your model.

    • Generate predictions using a trained model in scikit-learn.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [54]:
    training_mae = mean_absolute_error(y_train,model.predict(X_train))
    Training MAE: 2.46
    Test MAE: 1.8
    

    Communicate Results¶

    Task 3.2.18: Extract the intercept and coefficient from your model.

    • Access an object in a pipeline in scikit-learnWQU WorldQuant University Applied Data Science Lab QQQQ
    [61]:
    print(f"P2 = {intercept} + ({coefficient} * P2.L1)")
    P2 = 3.36 + (0.64 * P2.L1)
    
    [56]:
    VimeoVideo("665412870", h="318d69683e", width=600)
    [56]:

    Task 3.2.19: Create a DataFrame df_pred_test that has two columns: "y_test" and "y_pred". The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of df_pred_test matches the index of y_test.

    • Create a DataFrame from a dictionary using pandas.
    [62]:
    df_pred_test = pd.DataFrame({"y_test":y_test,"y_pred":model.predict(X_test)})
    [62]:
    y_test y_pred
    timestamp
    2018-12-07 17:00:00+03:00 7.070000 8.478927
    2018-12-07 18:00:00+03:00 8.968333 7.865485
    2018-12-07 19:00:00+03:00 11.630833 9.076421
    2018-12-07 20:00:00+03:00 11.525833 10.774814
    2018-12-07 21:00:00+03:00 9.533333 10.707836
    [63]:
    VimeoVideo("665412891", h="39d7356a26", width=600)
    [63]:

    Task 3.2.20: Create a time series line plot for the values in test_predictions using plotly express. Be sure that the y-axis is properly labeled as "P2".

    • Create a line plot using plotly express.
    [65]:
    fig = px.line(df_pred_test)

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    Time Series: Statistical Models

    Autoregression¶

    Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works similarly to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.

    Cleaning the Data¶

    Just like with linear regression, we'll start by bringing in some tools to help us along the way.

    [ ]:
    warnings.simplefilter(action="ignore", category=FutureWarning)

    Since we'll be working with the "air-quality" data again, we need to connect to the server, start our client, and grab the data we need.

    [ ]:
    client = MongoClient(host="localhost", port=27017)

    Practice

    Just to make sure we're all on the same page, import all those libraries and get your database up and running. Remember that even though all the examples use the Site 3 data from the lagos collection, the practice sets should use Site 4 data from the lagos collection. Call your database lagos_prac.

    [ ]:
    lagos_prac = ...

    In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to "timestamp":

    [ ]:
        {"metadata.site": 3, "metadata.measurement": "P2"},

    Practice

    Try it yourself! Create a list called results_prac that pulls data from Site 4 in the lagos data, then save it in a DataFrame called df_prac with the index "timestamp".

    [ ]:
    ​x
     

    Localizing the Timezone¶

    Because MongoDB stores all timestamps in UTC, we need to figure out a way to localize it. Having timestamps in UTC might be useful if we were trying to predict some kind of global trend, but since we're only interested in what's happening with the air in Lagos, we need to change the data from UTC to Africa/Lagos. Happily, pandas has a pair of tools to help us out: tz_localize and tz_convert. We use those methods to transform our data like this:

    [ ]:
    df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")

    Resampling Data¶

    The most important kind of data in our time-series model is the data that deals with time. Our "timestamp" data tells us when each reading was taken, but in order to create a good predictive model, we need the readings to happen at regular intervals. Our data doesn't do that, so we need to figure out a way to change it so that it does. The resample method does that for us.

    Let's resample our data to create 1-hour reading intervals by aggregating using the mean:

    [ ]:
    df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()

    Notice the second half of the code:

    fillna(method="ffill").to_frame()
    

    That tells the model to forward-fill any empty cells with imputed data. Forward-filling means that the model should start imputing data based on the closest cell that actually has data in it. This helps to keep the imputed data in line with the rest of the dataset.

    Adding a Lag¶

    We've spent some time elsewhere thinking about how two sets of data — apartment price and location, for example — compare to each other, but we haven't had any reason to consider how a dataset might compare to itself. If we're predicting the future, we want to know how good our prediction will be, so it might be useful to build some of that accountability into our model. To do that, we need to add a lag.

    Lagging data means that we're adding a delay. In this case, we're going to allow the model to test itself out by comparing its predictions with what actually happened an hour before. If the prediction and the reality are close, then it's a good model; if they aren't, then the model isn't a very good one.

    So, let's add a one-hour lag to our dataset:

    [ ]:
    # In `shift(1), the `1` is the lagged interval.

    Finally, let's drop our null values:

    [ ]:
    y = df["P2"].resample("1H").mean().fillna(method="ffill")

    Practice

    Try it yourself! Clean the Site 2 data from lagos, and save it as a Series called y_prac.

    [ ]:
    df_prac["P2.L1"] = ...

    Exploring the Data¶

    Time Series Line Plot¶

    Example of

    Creating an ACF Plot¶

    Let's make an ACF plot using our y Series.

    [ ]:
    fig1, ax = plt.subplots(figsize=(15, 6))

    Each of the dots on our plot represents a correlation coefficient. The first data point in the top left of the graph tells us that at time-step 0, the correlation coefficient was 1, meaning that there was a perfect correlation. That makes sense, because you can't lag from time-step 0, so the coefficient can't be anything other than 1. But, starting at hour 1, the coefficient drops precipitously, and we see our autocorrelation coefficients slowly decay over time. As our lag recedes further into the past, the correlations break down; a prediction you made five hours ago about what's happening right now is going to be a lot more reliable than a prediction you made 96 hours ago.

    The light blue shape across the bottom of the graph represents the confidence interval, or the extent to which we can be sure that our estimated correlations reflect the correlations that exist in reality. By default, this is set to 95%. Data points which fall either above or below the shape are likely not due to chance, and those which fall inside the shape are likely due to chance. It looks like all our data is the result of some kind of effect, so we're good to go.

    Practice

    Try it yourself! Make an ACF plot called fig2 using your y_prac Series.

    [ ]:
    fig2, ax = ...

    Creating a PACF Plot¶

    Let's make a PACF plot using our y Series.

    [ ]:
    fig1, ax = plt.subplots(figsize=(15, 6))

    Aha! This looks very different. There are two things to notice here:

    First, we now have lots of data points that we can be relatively certain aren't due to chance, but we also have lots of data points inside the blue shape at the bottom, indicating that some of our data points are indeed due to chance. That's not necessarily a problem, but it's something useful to keep in mind.

    Second, recognize that even though the amplitude of the points on our graph has been significantly reduced, the trend has remained essentially the same: Strong positive correlations at the beginning, with the effect decaying over time. We would expect to see that, because the farther out into the future our predictions go, the less accurate they become.

    Practice

    Try it yourself! Make an PACF plot using your y_prac Series.

    [ ]:
    fig2, ax = ...

    Working with Rolling Windows¶

    Rolling window is an important concept for time series analysis. We first define a window size, like 7 days, three months, etc. Then we calculate some statistics taking data from each window sequentially throughout the time series. For example, if I want to calculate a three-month rolling sum with the time series data below:

    Month sales
    2022-01 10
    2022-02 20
    2022-03 25
    2022-04 15
    2022-05 20
    2022-06 30

    The three-month rolling sum would be

    Rolling Months Rolling sum sales
    2022-01,02,03 55
    2022-02,03,04 60
    2022-03,04,05 60
    2022-04,05,06 65

    Rolling window statistics are very helpful in smoothing noisy data when making time series predictions. Let's see it with an example. Since we're interested in making predictions about the air quality in Lagos, it would be helpful to understand the rolling average for the PM 2.5 readings with a line plot. To keep things manageable, we'll set our window-size to one week.

    [ ]:
    # `168` is the number of hours in a week.

    Even though there are lots of peaks and valleys here, we're starting to see an emerging trend.

    We can make the same graph using pandas, like this:

    Practice

    Try it yourself! Make a line plot that shows the weekly rolling average of the P2 values in the Site 2 dataset.

    [ ]:
    ​x
     

    Besides rolling sum and rolling average, rolling window statistics can be applied to a lot of other statistics depends on the problem you are facing. In the example below, when we use GARCH model to analyze stock prices, we can use rolling window to calculate standard deviation.

    Splitting the Data in pandas¶

    The last thing to do in our data exploration is to split our data into training and test sets. For linear regression, we used an 80/20 split, where we used 80% of the data was our training set, and 20% of it was our test set. This time, we're going to expand the test set to 95%, and decrease the test set to %5 to bring it into line with statsmodels default confidence interval. This is important, because we'll need to use as much training data as we can if our model is going to accurately predict what's going to come next.

    [ ]:
    cutoff_test1 = int(len(y) * 0.95)

    Practice

    Try it yourself! Create a cutoff called cutoff_test2, split the y_pracSeries into training and test sets, making sure to set the cutoff to 0.95.

    [ ]:
    cutoff_test2 = ...

    Building the Model¶

    Baseline¶

    First, let's calculate the baseline MAE for our model.

    [ ]:
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

    Practice

    Try it yourself! Calculate the baseline mean and MAE for the y_prac Series.

    [ ]:
    y_prac_pred_baseline = ...

    Iterating¶

    Before we can go any further, we need to instantiate an autoregression model based on our y training data. We'll call the model model.

    [ ]:
    model = AutoReg(y_train, lags=24, old_names=False).fit()

    Notice that, unlike our linear regression model which we built using scikit-learn, we're combining instantiation and fitting into one step; statsmodels includes that ability in its AutoReg method.

    Practice

    Try it yourself! Create and fit an autoregression model called model_prac.

    [ ]:
    model_prac = ...

    Autoregression models need us to generate in-sample predictions in order to calculate the MAE of our training data. In-sample predictions use data that's already part of our sample. That's to distinguish it from out-of-sample predictions, which we'll talk about a bit later. The statsmodels library includes a method called predict that can help us here. Above, the AutoReg method includes this line:

    old_names=False
    

    The False value here tells the model that it can use in-sample lagged values to make predictions; if the value had been True, the model would have to look elsewhere to make its predictions.

    Here's how to generate in-sample predictions:

    [ ]:
    y_pred = model.predict().dropna()

    Once we've done that, we can calculate the MAE of the predictions in our training set.

    [ ]:
    training_mae = mean_absolute_error(y_train.loc[y_pred.index], y_pred)

    Practice

    Try it yourself! Generate in-sample predictions using y_prac, and find the MAE for your y_prac training data. Print the result.

    [ ]:
        y_prac_train.loc[y_prac_pred.index], y_prac_pred

    Residuals¶

    We're going to use our model's residuals to make some visualizations, but first, we need to calculate what those residuals are.

    [ ]:
    y_train_resid = y_train - y_pred

    Now we can make a line plot of our model's residuals.

    [ ]:
    fig1, ax = plt.subplots(figsize=(15, 6))

    The ideal residual plot has a random set of datapoints spread evenly on both sides of the line. The plot we just made actually looks pretty good; there are some significant outliers, but, on the whole, the bars describe an even band of values, which is what we're looking for.

    Practice

    Try it yourself! Calculate the residuals for y_prac and visualize them on a line plot called fig2.

    [ ]:
    y_prac_train_resid = ...

    Let's also take a look at a histogram of the residuals to help us see how they're distributed.

    [ ]:
    y_train_resid.hist();

    Remember, when we make histograms, we're trying to answer two questions:

    1.) Is it a normal distribution? 2.) Are there any outliers?

    For our histogram, that middle bar is pretty tall, but the shape described by all the bars looks like a normal distribution (albeit a stretched one), so the answer to the first question is "yes." Outliers are values that fall beyond the shape of a normal distribution, and it doesn't look like we have any of those, so the answer to the second question is "no." Those are the answers we're looking for, so let's move on to the next step.

    ACF Plots¶

    We're going to make an ACF plot to see how much variation there is in the dataset.

    [ ]:
    plot_acf(y_train_resid.dropna(), ax=ax);

    At first, this might seem wrong, but we're actually looking for a mostly-flat graph here. This is an indication that our model describes all the seasonality, or regular changes, in our data. In other words, this graph is exactly what we're looking for.

    Practice

    Try it yourself! Calculate the make a histogram and an ACF plot of the y_prac data.

    [ ]:
    ​x
     
    [ ]:
    ​x
     

    Evaluating the Model¶

    Now that we've built an autoregression model that seems to be working pretty well, it's time to evaluate it. We've already established that the model works well when compared to itself, but what about how well it works when we start looking outside our original dataset?

    Out-of-Sample Predictions¶

    To look outside the data, we need to create a new set of predictions. The process here is very similar to the way we made our baseline predictions. We're still using predict, but we're using the test data instead of the train data.

    [ ]:
    y_pred_test = model.predict(y_test.index.min(), y_test.index.max())

    Now that we have a prediction, we can calculate the MAE of our out-of-sample data.

    [ ]:
    test_mae = mean_absolute_error(y_test, y_pred_test)

    Practice

    Try it yourself! Generate out-of-sample predictions using your y_prac data and model_prac, calculate the MAE, and print the result.

    [ ]:
        y_prac_test.index.min(), y_prac_test.index.max()

    Now that we have some out-of-sample predictions, we can compare it to our in-sample predictions using a line plot. The first step there is to create a new DataFrame called test1_predictions with two columns: one for the y_test data (the true data) and one for the y_pred (the predicted data). It's always a good idea to print the first five rows of a new DataFrame, just to make sure it looks the way it should.

    [ ]:
        {"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index

    That looks correct, so we can move on to our line plot.

    [ ]:
    fig = px.line(test1_predictions, labels={"value": "P2"})

    This looks kind of strange, but it's actually exactly what we would expect to see. At the beginning, the y_pred data has a fair amount of predictive power, but, as time goes on, the predictions become less and less accurate. It's kind of like what happened with our ACF plots, only in reverse. Last time, the model lost its predictive power as the lag increased. Here, the model loses its predictive power as the horizon — how far away from the present your predictions are — increases. But don't worry! We'll fix it in a second.

    Practice

    In the meantime, try it yourself! Make a DataFrame with columns for y_prac_test and y_prac_pred, and print the result. Then, make a line plot that shows the relationship between the two variables.

    [ ]:
    ​x
     
    [ ]:
    ​x
     

    Walk-forward Validation¶

    Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past. Let's see how it works.

    [ ]:
    # Then, we define a variable that takes into account what's happened in the past

    You'll notice that we're using the same AutoReg method we used before, only this time, we're using the y_train data. Also like before, the 24 is telling the model how many hours it should pay attention to. If you change that number, the MAE will change too.

    Speaking of the MAE, let's calculate it and see what we've got.

    [ ]:
    print("Test MAE 1 (walk forward validation):", round(test1_mae, 2))

    Practice

    Try it yourself! Perform a walk-forward validation of your model using the y_prac_train data. Then, calculate the MAE and print the result. Note that because we're using %%capture in the validation cell, you'll need to create a new cell for your MAE calculation.

    [ ]:
    y_prac_pred_wfv = ...
    [ ]:
    test2_mae = ...

    Communicating the Results¶

    In machine learning, the model's parameters are the parts of the model that are learned from the training data. There are also hyperparameters, which we'll discuss in the next module. For now, just know that parameters come from inside the model, and hyperparameters are specified outside the model.

    So, let's print the parameters of our validated model and see what it looks like.

    [ ]:
    print(model.params)

    That looks pretty good, but showing it in a line plot would be much better.

    [ ]:
        {"y_test": y_test, "y_pred": y_pred_wfv}, index=y_test.index

    That looks much better! Now our predictions are actually tracking the test data, just like they did in the linear regression model.

    Practice

    Try it yourself! Access the parameters of model_prac, put y_prac_test and y_prac_pred_wfv into the test2_predictions DataFrame, and create a line plot using plotly express.

    [ ]:
    ​x
     
    [ ]:
    ​x
     

    ARMA Models & Hyperparameters¶

    ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.

    Regardless of the size of the shock, ARMA models can still predict the future. All we need to make that work is data.

    Cleaning the Data¶

    As always, we need to import all the tools we'll need to make our model.

    [ ]:
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

    And then we need to get our database client up and running.

    [ ]:
    client = MongoClient(host="localhost", port=27017)

    Then, we need to clean our data. All the examples will use data from Site 3; all the practice sets will use Site 2. If you need a refresher on how all those methods work, refer back to the Autoregression notebook.

    [ ]:
    df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")

    Practice

    Try it yourself! Get your client up and running and call your database db_prac. Create a variable called results_prac, and read in a collection called lagos_prac using data from Site 2. Save it as a Series called y_prac.

    [ ]:
    df_prac["P2.L1"] = ...

    Exploring the Data¶

    Just like we did with AR, we'll start by exploring the data. Let's make a histogram.

    [ ]:
    y.hist();

    Practice

    Try it yourself! Make a histogram using y_prac.

    [ ]:
    ​x
     

    This is what the data looks like when our sample is 1-hour intervals, but we might want to be able to quickly change our sample to other intervals of time. First, we'll create a function called wrangle, and then add an argument. In Python, arguments tell the function what to do. This function already has an argument called collection, so we'll need to add another to make resampling work. We'll call that argument resamp_pd. WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    # Here's where the new argument goes. We're setting the default value to `"1H"`.

    Now let's change "1H" to "1D" and see what happens.

    [ ]:
    y = wrangle(lagos, resamp_pd="1D")

    As you can see on the left side of the table, the samples are now at one day intervals, which is exactly what we wanted!

    Let's make a new histogram to see if changing the sampling interval made a difference in the data.

    [ ]:
    y.hist();

    This looks pretty different! It's always nice to have a diversified dataset.

    Practice

    Try it yourself! Define a function called wrangle_prac run it, and print the results of y_prac. Then, create a new histogram from y_prac.

    [ ]:
    print(y_prac)
    [ ]:
    ​x
     

    Like with our AR model, we need to create ACF and PACF plots to see what's happening with the correlation coefficients.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))

    And now let's make a PACF plot.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))

    Practice

    Try it yourself! Make a PAC and a PACF plot using your y_prac data.

    [ ]:
    ​x
     
    [ ]:
    ​x
     

    Splitting the Data¶

    In our AR model, we split our data based on the number of observations we wanted to investigate. This time, we're going to split our data based on the date, using just the readings from October 2018. So, just like we did before, we'll create a training set using y, but instead of using percentages to split the data, we'll use dates.

    [ ]:
    # Notice that the date format is `YYYY-MM-DD`

    Practice

    Try it yourself! Create a training dataset called y_prac_train based on November 2017. (Hint: there are 30 days in November.)

    [ ]:
    y_prac_train = ...

    Building the Model¶

    Baseline¶

    The first thing we need to do is calculate the MAE for our new model:

    [ ]:
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

    Practice

    Try it yourself! Calculate the mean and MAE for the y_prac Series, and print the results.

    [ ]:
    y_prac_pred_baseline = ...

    Iterating¶

    So far, the only difference between our old AR model and the new ARMA model we're building is that the new model's data is based on the date rather than on the length of the variable. But the difference between AR and ARMA is the addition of hyperparameters.

    Hyperparameters¶

    Let's set our p values to include values from 0 to 25, moving in steps of 8:

    [ ]:
    p_params = range(0, 25, 8)

    And let's set our q values to include values from 0 to 3, moving in steps of 1:

    [ ]:
    q_params = range(0, 3)

    Practice

    Using p_params_prac, set the p value to include vales from 1 to 4, moving in steps of 1. Then, using q_params_prac, set the q value to include values from 0 to 3, moving in steps of 1.

    [ ]:
    p_params_prac = ...

    In order to tell the model to keep going through all the possible combinations, we'll add in a pair of for loops. (If you need a refresher on for loops, refer to Notebook 001.)

    [ ]:
            # Here's where we tell the model how we want it to deal with time

    Practice

    Try it yourself! Create an ARMA model called mode_prac2 based on a dictionary called maes_prac, using your training and test data, then print the results and append the MAE to maes_prac.

    [ ]:
    ​x
     

    Now that we have a working ARMA model, let's turn the output into a DataFrame so we can see what happened.

    [ ]:
    mae_grid = pd.DataFrame(maes)

    And let's visualize the DataFrame in a heatmap. (If you need a refresher on how to create a heatmap in seaborn, refer to Notebook 008.)

    [ ]:
    plt.title("Grid Search (Criterion: MAE)");

    Practice

    Try it yourself! Turn read the output of your ARMA model into a DataFrame called mae_grid_prac, and visualize it in a heatmap.

    [ ]:
    mae_grid_prac = ...

    It looks like our MAE values are in the right place, but let's try some other ways to explore our new model using the plot_diagnostics method.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 12))

    As usual, we have quite a lot to sift through here. The first graph is showing us our model's residuals. Ideally, we'd like to see this be as close to zero as possible, and this graph is telling us that, for the most part, we have a good model.

    The next graph over shows us another version of the same thing. The histogram is similar to the one we made before, but there's a pair of lines superimposed. These lines are indicating the kernel density, which is another way of saying that it's a smoothed-out version of the blue histogram bars. The green line represents a normal distribution, and is included here just to give you something to compare to the values from your model. The orange line represents the smoothed-out version of the result off your model. Our model is actually pretty close to a normal distribution, so that's good!

    The Q-Q plot on the bottom left is yet another way to visualize the same thing. Here, the red line is showing us a perfect 1:1 correlation between our variables, and the wavy blue line is showing us what we actually have. Again, our model is pretty close to the red line, so it's looking good.

    And finally, we have a correlogram, which might look familiar; it's the same kind of plot as the ACF and PACF plots from our AR models.

    Practice

    Try it yourself! Use plot_diagnostics to examine the residuals from model_prac.

    [ ]:
    ​x
     

    Communicating the Results¶

    Now that we have an ARMA model that seems to be working well, it's time to communicate the results of our analysis in a line graph. Let's create a graph that shows the relationship between our training and predicting data. (For a refresher on how to do this and what it means, refer to the AR notebook.)

    [ ]:
        {"y_train": y_train, "y_pred": y_train_pred}, index=y_train.index

    Practice

    Try it yourself! Create a line plot that compares the y_prac_train and y_prac_pred values.

    [ ]:
    ​x
     

    ARCH and GARCH Models¶

    ARCH stands for autoregressive conditionally heteroscedastic, which models the variance of a time series. ARCH model assumes variance at time 𝑡t depends on the past squared observations. A GARCH (generalized autoregressive conditionally heteroscedastic) model uses values of the past squared observations and past variances to model the variance at time 𝑡t. ARCH and GARCH models are widely used in finance and econometric time series analysis. For more details, check out these YouTube videos.

    [ ]:
    YouTubeVideo("Li95a2biFCU")
    [ ]:
    YouTubeVideo("inoBpq1UEn4")

    Let's see an example of using GARCH model in forecasting Apple stock price volatility. We first import the data.

    [ ]:
    df = pd.read_csv("./data/AAPL.csv", parse_dates=["date"], index_col="date")

    The dataset range from 1999 to 2022. For simplicity, we only take 5 years of data.

    [ ]:
    df = df["2015-10-13":]

    Calculating Returns¶

    The next step is to calculate the volatility of the stock close prices. Here we can use the pct_change function from pandas to calculate the daily percentage change, then multiply the result by 100 to get the returns:

    [ ]:
    df["return"] = df["close"].pct_change() * 100

    We can then take out the return column as our training data. Note the first observation is missing since there is no past value to observe. We need to drop it:

    [ ]:
    AAPL_return = df["return"].dropna()

    We can check the histogram to see the distribution of the returns over the past five years:

    [ ]:
    plt.title("Distribution of AAPL Daily Returns");

    There's a negative outlier in this date range, with the idxmin function, we find out it was in 31 August 2020. The stock price has a huge drop due to a stock split.

    [ ]:
    AAPL_return.idxmin(), AAPL_return.min()

    We can also check the standard deviation of the whole dataset with the pandas std() function:

    [ ]:
    AAPL_return.std()

    To see the statistic more clearly, we can use some time series plots. In addition to the Apple stock price return plot, we can also plot rolling volatility of the return to smooth out the noise, and see how return and volatility are associated with each other. First we can calculate a 50-day rolling standard deviation series for Apple stock price return:

    [ ]:
    AAPL_return_rolling_50d_volatility = AAPL_return.rolling(window=50).std().dropna()

    We can plot these two series:

    [ ]:
        ax=ax, label="50d rolling volatility", linewidth=3

    Here we can see that volatility goes up when the returns change drastically up or down. For instance, when return dropped enormously in 2020 August, volatility also increased a lot. However, this plot reveals a problem. We want to use returns to see if high volatility on one day is associated with high volatility on the following day. But high volatility is caused by large changes in returns, which can be either positive or negative. How can we assess negative and positive numbers together without them canceling each other out? The common solution used in building ARCH or GARCH models are to square the returns. Let's plot the squared returns:

    [ ]:
    (AAPL_return**2).plot(ax=ax, label="daily return")

    Now we it much easier to see groups high and low volatility and build a GARCH model. To build a GARCH model, we need to figure out both the p and q parameters. The p parameter is handling correlations at prior time steps and the q parameter deals with prior variances, like shocks. It also uses the notion of lag. To see how many lags we should have in our model, we should create an ACF and PACF plot — but using the squared returns.

    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))
    [ ]:
    fig, ax = plt.subplots(figsize=(15, 6))

    Both the ACF and PACF graph show one lag is enough to build the model.

    Building the Model¶

    [ ]:
    model = arch_model(AAPL_return, p=1, q=1, rescale=False).fit(disp=0)

    Common Metrics¶

    The model summary provides a lot of information about the model, like the trained parameters, model performance, etc. AIC and BIC are important measurements for model performance. AIC stands for Akaike Information Criterion, it measures the goodness of fit of any estimated statistical model. BIC is Bayesian Information Criteria that selects model from a finite set of models. When fitting models, it is possible to increase model performance by adding more parameters, which would also result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Though BIC is always higher than AIC, lower the value of these two measures, better the model.

    Standardized Residuals¶

    After fitting the model with the data, we can get also get the residuals to see how the model performs. The residual at time 𝑡t is the observed return at time 𝑡t minus the model's estimated mean return at time 𝑡t. For other models that model observations directly, like stock prices, the assumptions are the residuals are the noises that follow a normal distribution. In GARCH model, since we are model returns, which is the variance of the stock prices, the residuals of the variance will not follow a normal distribution. So we need to use standardized residuals instead of residuals to check the normality. Standardized residual at time 𝑡t is the residual at time 𝑡t divided by the square root of the model estimated variance at time 𝑡t. Let's check the plot of the standardized residuals across the time series:

    [ ]:
    model.std_resid.plot(ax=ax, label="Standardized Residuals")

    The standardized residuals are moving around 0 except for the outlier events from 2020 August. Let's check the normality by the histogram:

    [ ]:
    plt.title("Distribution of Standardized Residuals");

    If we exclude the outlier, the other standardized residuals do follow a normal distribution with mean 0.

    Evaluation¶

    We can further evaluate the model by comparing its forecast with a subset of the observed returns to see whether the model has successfully captured the volatility. We can first check the model conditional volatility in its confidence interval throughout the dataset:

    [ ]:
    (-2 * model.conditional_volatility.rename("")).plot(ax=ax, color="C1")

    We can then select the test set of the data and do a walk-forward validation on the GARCH model:

    [ ]:
        next_pred = model.forecast(horizon=1, reindex=False).variance.values[0][0] ** 0.5

    We can plot the predicted volatility versus return in this test set, and see the model has captured most of the variance here:

    [ ]:
    (2 * y_test_wfv).plot(ax=ax, c="C1", label="2 SD Predicted Volatility")

    Forecasting¶

    The last step is to make the forecasts with our trained model. Let's make a one-day forecast first to see how GARCH model forecasting works:

    [ ]:
    one_day_forecast = model.forecast(horizon=1, reindex=False).variance

    There are two things we need to keep in mind here. First, our model forecast shows the predicted variance, not the standard deviation / volatility. So we'll need to take the square root of the value. Second, the prediction's index is in the format of "h.1" as the next date of the last training data date. It's not in the form of DatetimeIndex that we desire. So we need to format the forecasts to be more readable.

    We first generate 5 days predictions with our trained GARCH model:

    [ ]:
    prediction_dates = pd.bdate_range(start=start, periods=prediction.shape[1])

    Then we define a function to clean the predictions to be more readable:

    [ ]:
    prediction_formatted = pd.Series(data, index=prediction_index)

    References & Further Reading¶

    • More on ARMA models
    • Even more in ARMA models
    • Information on p and q parameters
    • A primer on autoregression
    • More information on ACF plots
    • A primer on ACF and PACF
    • Background on residuals
    • More on walk-forward validation
    • Reading on parameters and hyperparameters

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    3.3. Autoregressive Models

    [1]:
    xxxxxxxxxx
     
    import warnings
    ​
    import matplotlib.pyplot as plt
    import pandas as pd
    import plotly.express as px
    from IPython.display import VimeoVideo
    from pymongo import MongoClient
    from sklearn.metrics import mean_absolute_error
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    from statsmodels.tsa.ar_model import AutoReg
    ​
    warnings.simplefilter(action="ignore", category=FutureWarning)
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    
    [2]:
     
    VimeoVideo("665851858", h="e39fc3d260", width=600)
    [2]:

    1. Prepare Data¶

    1.1. Import¶

    [3]:
     
    VimeoVideo("665851852", h="16aa0a56e6", width=600)
    [3]:

    Task 3.3.1: Complete to the create a client to connect to the MongoDB server, assigns the "air-quality" database to db, and assigned the "nairobi" connection to nairobi.

    • Create a client object for a MongoDB instance.
    • Access a database using PyMongo.
    • Access a collection in a database using PyMongo.
    [4]:
     
    client = MongoClient(host="localhost", port=27017)
    db = client["air-quality"]
    ​
    nairobi = db["nairobi"]
    [5]:
     
    VimeoVideo("665851840", h="e048425f49", width=600)
    [5]:

    Task 3.3.2: Change the wrangle function below so that it returns a Series of the resampled data instead of a DataFrame.

    [6]:
     
    def wrangle(collection):
        results = collection.find(
            {"metadata.site": 29, "metadata.measurement": "P2"},
            projection={"P2": 1, "timestamp": 1, "_id": 0},
        )
    ​
        # Read data into DataFrame
        df = pd.DataFrame(list(results)).set_index("timestamp")
    ​
        # Localize timezone
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    ​
        # Remove outliers
        df = df[df["P2"] < 500]
    ​
        # Resample to 1hr window
        y = df["P2"].resample("1H").mean().fillna(method='ffill')
    ​
        return y

    Task 3.3.3: Use your wrangle function to read the data from the nairobi collection into the Series y.

    [7]:
     
    y = wrangle(nairobi)
    y.head()
    [7]:
    timestamp
    2018-09-01 03:00:00+03:00    17.541667
    2018-09-01 04:00:00+03:00    15.800000
    2018-09-01 05:00:00+03:00    11.420000
    2018-09-01 06:00:00+03:00    11.614167
    2018-09-01 07:00:00+03:00    17.665000
    Freq: H, Name: P2, dtype: float64
    [8]:
     
    # Check your work
    assert isinstance(y, pd.Series), f"`y` should be a Series, not type {type(y)}"
    assert len(y) == 2928, f"`y` should have 2928 observations, not {len(y)}"
    assert y.isnull().sum() == 0

    1.2. Explore¶

    [9]:
     
    VimeoVideo("665851830", h="85f58bc92b", width=600)
    [9]:
    [10]:
     
    #y.corr(y) #correlation between y it self is 1
    #y.corr(y.shift(1)) #correlation between y its one hour lag value is 0.65
    #y.corr(y.shift(2)) #correlation between y its one hour lag value is 0.4
    #y.corr(y.shift(3)) #correlation between y its one hour lag value is 0.3
    #we can find the same correlations using ACF plot

    Task 3.3.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's an ACF plot?
    • Create an ACF plot using statsmodels
    [11]:
     
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_acf(y,ax=ax)
    plt.xlabel("Lag [hours]")
    plt.ylabel("Correlation Coefficient");
    [12]:
    xxxxxxxxxx
     
    VimeoVideo("665851811", h="ee3a2b5c24", width=600)
    [12]:

    Task 3.3.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's a PACF plot?
    • Create an PACF plot using statsmodels
    [13]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_pacf(y,ax=ax)
    plt.xlabel("Lag [hours]")
    plt.ylabel("Correlation Coefficient");

    1.3. Split¶

    [14]:
    xxxxxxxxxx
     
    VimeoVideo("665851798", h="6c191cd94c", width=600)
    [14]:

    Task 3.3.6: Split y into training and test sets. The first 95% of the data should be in your training set. The remaining 5% should be in the test set.

    • Divide data into training and test sets in pandas.
    [15]:
    xxxxxxxxxx
     
    cutoff_test = int(len(y)*0.95)
    ​
    y_train = y.iloc[:cutoff_test]
    y_test = y.iloc[cutoff_test:]
    [16]:
    xxxxxxxxxx
     
    len(y_train)+len(y_test)==len(y)
    [16]:
    True

    2. Build Model¶

    2.1. Baseline¶

    Task 3.3.7: Calculate the baseline mean absolute error for your model.

    • Calculate summary statistics for a DataFrame or Series in pandas.
    [17]:
    xxxxxxxxxx
     
    y_train_mean = y_train.mean()
    y_pred_baseline = [y_train_mean] * len(y_train)
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
    ​
    print("Mean P2 Reading:", round(y_train_mean, 2))
    print("Baseline MAE:", round(mae_baseline, 2))
    Mean P2 Reading: 9.22
    Baseline MAE: 3.71
    

    2.2. Iterate¶

    [18]:
    xxxxxxxxxx
     
    VimeoVideo("665851769", h="94a4296cde", width=600)
    [18]:

    Task 3.3.8: Instantiate an AutoReg model and fit it to the training data y_train. Be sure to set the lags argument to 26.

    • What's an AR model?
    • Instantiate a predictor in statsmodels.
    • Train a model in statsmodels.
    [19]:
    xxxxxxxxxx
     
    model = AutoReg(y_train,lags=26).fit()
    [20]:
    xxxxxxxxxx
     
    VimeoVideo("665851746", h="1a4511e883", width=600)
    [20]:

    Task 3.3.9: Generate a list of training predictions for your model and use them to calculate your training mean absolute error.

    • Generate in-sample predictions for a model in statsmodels.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [21]:
    xxxxxxxxxx
     
    y_pred = model.predict().dropna()
    training_mae = mean_absolute_error(y_train.iloc[26:],y_pred)
    print("Training MAE:", training_mae)
    Training MAE: 2.2809871656467036
    
    [22]:
    xxxxxxxxxx
     
    VimeoVideo("665851744", h="60d053b455", width=600)
    [22]:

    Task 3.3.10: Use y_train and y_pred to calculate the residuals for your model.

    • What's a residual?
    • Create new columns derived from existing columns in a DataFrame using pandas.
    [23]:
    xxxxxxxxxx
     
    #y_train_resid = y_train-y_pred #which is equal to or same results with below code
    y_train_resid = model.resid
    y_train_resid.tail()
    [23]:
    timestamp
    2018-12-25 19:00:00+03:00   -0.392002
    2018-12-25 20:00:00+03:00   -1.573180
    2018-12-25 21:00:00+03:00   -0.735747
    2018-12-25 22:00:00+03:00   -2.022221
    2018-12-25 23:00:00+03:00   -0.061916
    Freq: H, dtype: float64
    [24]:
    xxxxxxxxxx
     
    VimeoVideo("665851712", h="9ff0cdba9c", width=600)
    [24]:

    Task 3.3.11: Create a plot of y_train_resid.

    • Create a line plot using pandas.
    [25]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    y_train_resid.plot(ylabel="Residual Value",ax=ax)
    [25]:
    <AxesSubplot:xlabel='timestamp', ylabel='Residual Value'>
    [26]:
    xxxxxxxxxx
     
    VimeoVideo("665851702", h="b494adc297", width=600)
    [26]:

    Task 3.3.12: Create a histogram of y_train_resid.

    • Create a histogram using plotly express.
    [27]:
    xxxxxxxxxx
     
    y_train_resid.hist()
    plt.xlabel("Residual Value")
    plt.ylabel("Frequency")
    plt.title("AR(26),Distribution of Residuals")
    [27]:
    Text(0.5, 1.0, 'AR(26),Distribution of Residuals')
    [28]:
    xxxxxxxxxx
     
    VimeoVideo("665851684", h="d6d782a1f3", width=600)
    [28]:

    Task 3.3.13: Create an ACF plot of y_train_resid.

    • What's an ACF plot?
    • Create an ACF plot using statsmodels
    [29]:
    xxxxxxxxxx
     
    fig, ax = plt.subplots(figsize=(15, 6))
    plot_acf(y_train_resid,ax=ax)
    [29]:

    2.3. Evaluate¶

    [30]:
    xxxxxxxxxx
     
    VimeoVideo("665851662", h="72e767e121", width=600)
    [30]:
    [31]:
    xxxxxxxxxx
     
    # #method showed in video to work on test data and check mean_absolute_error
    # y_test.index.min()#to get the first value in test data using index
    # y_test.index.max()##to get the first value in test data using index
    # y_pred_test = model.predict(y_test.index.min(),y_test.index.max()).dropna()
    # test_mae = mean_absolute_error(y_test,y_pred_test)
    # print("Test MAE:", test_mae)
    [32]:
    xxxxxxxxxx
     
    #method used in training model
    model = AutoReg(y_test,lags=26).fit()

    Task 3.3.14: Calculate the test mean absolute error for your model.

    • Generate out-of-sample predictions using model in statsmodels.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [33]:
    xxxxxxxxxx
     
    y_pred_test = model.predict().dropna()
    test_mae = mean_absolute_error(y_test.iloc[26:],y_pred_test)
    print("Test MAE:", test_mae)
    Test MAE: 1.1581601912057071
    

    Task 3.3.15: Create a DataFrame test_predictions that has two columns: "y_test" and "y_pred". The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of test_predictions matches the index of y_test.

    • Create a DataFrame from a dictionary using pandas.WQU WorldQuant University Applied Data Science Lab QQQQ
    [34]:
    xxxxxxxxxx
     
    df_pred_test = pd.DataFrame(
        {"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index
    )
    [35]:
    xxxxxxxxxx
     
    VimeoVideo("665851628", h="29b43e482e", width=600)
    [35]:

    Task 3.3.16: Create a time series plot for the values in test_predictions using plotly express. Be sure that the y-axis is properly labeled as "P2".

    • Create a line plot in plotly express.
    [36]:
    xxxxxxxxxx
     
    fig = px.line(df_pred_test, labels={"value": "P2"})
    fig.show()
    Dec 262018Dec 27Dec 28Dec 29Dec 30Dec 31Jan 1201951015
    variabley_testy_predtimestampP2
    plotly-logomark
    [37]:
    xxxxxxxxxx
     
    VimeoVideo("665851599", h="bb30d96e43", width=600)
    [37]:

    Task 3.3.17: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv.

    • What's walk-forward validation?
    • Perform walk-forward validation for time series model.
    [38]:
    xxxxxxxxxx
     
    %%capture
    ​
    y_pred_wfv = pd.Series()
    history = y_train.copy()
    for i in range(len(y_test)):
        model = AutoReg(history,lags=26).fit()
        next_pred = model.forecast()
        y_pred_wfv = y_pred_wfv.append(next_pred)
        history = history.append(y_test[next_pred.index])
        
    ​
    [39]:
    xxxxxxxxxx
     
    len(y_pred_wfv)==len(y_test)
    [39]:
    True
    [40]:
    xxxxxxxxxx
     
    history.tail(1)
    [40]:
    2019-01-01 02:00:00+03:00    18.803333
    Freq: H, Name: P2, dtype: float64
    [ ]:
    xxxxxxxxxx
     
    ​
    [41]:
    xxxxxxxxxx
     
    model = AutoReg(history,lags=26).fit()
    [42]:
    xxxxxxxxxx
     
    model.forecast()
    [42]:
    2019-01-01 03:00:00+03:00    14.110356
    Freq: H, dtype: float64
    [43]:
    xxxxxxxxxx
     
    y_test.head(1)
    [43]:
    timestamp
    2018-12-26 00:00:00+03:00    5.679091
    Freq: H, Name: P2, dtype: float64
    [ ]:
    xxxxxxxxxx
     
    ​
    [44]:
    xxxxxxxxxx
     
    VimeoVideo("665851568", h="a764ab5416", width=600)
    [44]:

    Task 3.3.18: Calculate the test mean absolute error for your model.

    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    [45]:
    xxxxxxxxxx
     
    test_mae = mean_absolute_error(y_test,y_pred_wfv)
    print("Test MAE (walk forward validation):", round(test_mae, 2))
    Test MAE (walk forward validation): 1.4
    

    3. Communicate Results¶

    [46]:
    xxxxxxxxxx
     
    VimeoVideo("665851553", h="46338036cc", width=600)
    [46]:

    Task 3.3.19: Print out the parameters for your trained model.

    • Access model parameters in statsmodels
    [47]:
    xxxxxxxxxx
     
    #unlike in scikit learn finding intercept and coefficients in statsmodels is very easy. below code will give those intercept and coefficient values
    print(model.params)
    intercept    2.021763
    P2.L1        0.587991
    P2.L2        0.019749
    P2.L3        0.023426
    P2.L4        0.026807
    P2.L5        0.044019
    P2.L6       -0.101889
    P2.L7        0.029814
    P2.L8        0.049894
    P2.L9       -0.017513
    P2.L10       0.032585
    P2.L11       0.064034
    P2.L12       0.005978
    P2.L13       0.018366
    P2.L14      -0.007634
    P2.L15      -0.016571
    P2.L16      -0.015587
    P2.L17      -0.035295
    P2.L18       0.000929
    P2.L19      -0.003956
    P2.L20      -0.020668
    P2.L21      -0.012730
    P2.L22       0.052431
    P2.L23       0.074405
    P2.L24      -0.024263
    P2.L25       0.090515
    P2.L26      -0.088646
    dtype: float64
    
    [48]:
    xxxxxxxxxx
     
    VimeoVideo("665851529", h="39284d37ac", width=600)
    [48]:

    Task 3.3.20: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express.

    • Create a line plot in plotly express.
    [49]:
    xxxxxxxxxx
     
    df_pred_test = pd.DataFrame(
        {"y_test":y_test,"y_pred_wfv":y_pred_wfv}
    ​
    )
    fig = px.line(df_pred_test,labels={"Value":"PM2.5"})
    fig.show()
    Dec 262018Dec 27Dec 28Dec 29Dec 30Dec 31Jan 1201951015
    variabley_testy_pred_wfvindexvalue
    plotly-logomark

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    3.4. ARMA Models

    [1]:
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
      from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
    
    [2]:
    VimeoVideo("665851728", h="95c59d2805", width=600)
    [2]:

    Prepare Data¶

    Import¶

    Task 3.4.1: Create a client to connect to the MongoDB server, then assign the "air-quality" database to db, and the "nairobi" collection to nairobi.

    • Create a client object for a MongoDB instance.
    • Access a database using PyMongo.
    • Access a collection in a database using PyMongo.
    [3]:
    client = MongoClient(host="localhost",port=27017)
    [4]:
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
    [5]:
    VimeoVideo("665851670", h="3efc0c20d4", width=600)
    [5]:

    Task 3.4.2: Change your wrangle function so that it has a resample_rule argument that allows the user to change the resampling interval. The argument default should be "1H".

    • What's an argument?
    • Include an argument in a function in Python.
    [6]:
    ), f"Your function should take two arguments: `'collection'`, `'resample_rule'`. Your function takes the following arguments: {func_params}"

    Task 3.4.3: Use your wrangle function to read the data from the nairobi collection into the Series y.

    [7]:
    y = wrangle(nairobi,resample_rule="1H")
    [7]:
    timestamp
    2018-09-01 03:00:00+03:00    17.541667
    2018-09-01 04:00:00+03:00    15.800000
    2018-09-01 05:00:00+03:00    11.420000
    2018-09-01 06:00:00+03:00    11.614167
    2018-09-01 07:00:00+03:00    17.665000
    Freq: H, Name: P2, dtype: float64
    [8]:
    ), f"There should be no null values in `y`. Your `y` has {y.isnull().sum()} null values."

    Explore¶

    [9]:
    VimeoVideo("665851654", h="687ff8d5ee", width=600)
    [9]:

    Task 3.4.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's an ACF plot?
    • Create an ACF plot using statsmodels
    [10]:
    plt.ylabel("Correlation coefficient")
    [10]:
    Text(0, 0.5, 'Correlation coefficient')
    [11]:
    VimeoVideo("665851644", h="e857f05bfb", width=600)
    [11]:

    Task 3.4.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".

    • What's a PACF plot?
    • Create an PACF plot using statsmodels
    [12]:
    fig,ax = plt.subplots(figsize=(15,6))
    [12]:
    Text(0, 0.5, 'Correlation coefficient')

    Split¶

    Task 3.4.6: Create a training set y_train that contains only readings from October 2018, and a test set y_test that contains readings from November 1, 2018.

    • Subset a DataFrame by selecting one or more rows in pandas.
    [13]:
    y_test = y.get("November 1, 2018")
    [14]:
    print(len(y_test))
    24
    
    [15]:
    assert len(y_test) == 24, f"`y_test` should have 24 observations, not {len(y_test)}."

    Build Model¶

    Baseline¶

    Task 3.4.7: Calculate the baseline mean absolute error for your model.

    [16]:
    mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
    Mean P2 Reading: 10.12
    Baseline MAE: 4.17
    

    Iterate¶

    [17]:
    VimeoVideo("665851576", h="36e2dc6269", width=600)
    [17]:

    Task 3.4.8: Create ranges for possible 𝑝p and 𝑞q values. p_params should range between 0 and 25, by steps of 8. q_params should range between 0 and 3 by steps of 1.

    • What's a hyperparameter?
    • What's an iterator?
    • Create a range in Python.
    [18]:
    p_params = range(0,25,8)
    [18]:
    [0, 8, 16, 24]
    [19]:
    VimeoVideo("665851476", h="d60346ed30", width=600)
    [19]:

    Task 3.4.9: Complete the code below to train a model with every combination of hyperparameters in p_params and q_params. Every time the model is trained, the mean absolute error is calculated and then saved to a dictionary. If you're not sure where to start, do the code-along with Nicholas!

    • What's an ARMA model?
    • Append an item to a list in Python.
    • Calculate the mean absolute error for a list of predictions in scikit-learn.
    • Instantiate a predictor in statsmodels.
    • Train a model in statsmodels.
    • Write a for loop in Python.
    [20]:
        # Create key-value pair in dict. Key is `p`, value is empty list.
    Trained ARIMA (0, 0, 0) in 0.39 seconds.
    Trained ARIMA (0, 0, 1) in 0.4 seconds.
    Trained ARIMA (0, 0, 2) in 1.09 seconds.
    Trained ARIMA (8, 0, 0) in 9.9 seconds.
    Trained ARIMA (8, 0, 1) in 38.9 seconds.
    Trained ARIMA (8, 0, 2) in 75.5 seconds.
    Trained ARIMA (16, 0, 0) in 36.61 seconds.
    Trained ARIMA (16, 0, 1) in 126.9 seconds.
    Trained ARIMA (16, 0, 2) in 187.7 seconds.
    Trained ARIMA (24, 0, 0) in 108.79 seconds.
    Trained ARIMA (24, 0, 1) in 156.42 seconds.
    Trained ARIMA (24, 0, 2) in 306.1 seconds.
    
    {0: [4.171460443827197, 3.3506427433555537, 3.1057222587888647], 8: [2.9384480570404223, 2.9149008203151663, 2.908046094677133], 16: [2.9201084726122, 2.929436207635203, 2.913638894139686], 24: [2.914390327277196, 2.9136013252035013, 2.89792511429827]}
    
    [21]:
    VimeoVideo("665851464", h="12f4080d0b", width=600)
    [21]:

    Task 3.4.10: Organize all the MAE's from above in a DataFrame names mae_df. Each row represents a possible value for 𝑞q and each column represents a possible value for 𝑝p.

    • Create a DataFrame from a dictionary using pandas.
    [22]:
    mae_df = pd.DataFrame(mae_grid)
    [22]:
    0 8 16 24
    0 4.1715 2.9384 2.9201 2.9144
    1 3.3506 2.9149 2.9294 2.9136
    2 3.1057 2.9080 2.9136 2.8979
    [23]:
    VimeoVideo("665851453", h="dfd415bc08", width=600)
    [23]:

    Task 3.4.11: Create heatmap of the values in mae_grid. Be sure to label your x-axis "p values" and your y-axis "q values".

    • Create a heatmap in seaborn.
    [24]:
    sns.heatmap(mae_df,cmap="Blues")
    [24]:
    Text(33.0, 0.5, 'q values')
    [25]:
    VimeoVideo("665851444", h="8b58161f26", width=600)
    [25]:

    Task 3.4.12: Use the plot_diagnostics method to check the residuals for your model. Keep in mind that the plot will represent the residuals from the last model you trained, so make sure it was your best model, too!

    • Examine time series model residuals using statsmodels.
    [26]:
    fig, ax = plt.subplots(figsize=(15, 12))
    [26]:

    Evaluate¶

    [27]:
    VimeoVideo("665851439", h="c48d80cdf4", width=600)
    [27]:

    Task 3.4.13: Complete the code below to perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Choose the values for 𝑝p and 𝑞q that best balance model performance and computation time. Remember: This model is going to have to train 24 times before you can see your test MAE!WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
        history = history.append(y_test[next_pred.index])
    [ ]:
    print("Test MAE (walk forward validation):", round(test_mae, 2))

    Communicate Results¶

    [ ]:
    VimeoVideo("665851423", h="8236ff348f", width=600)

    Task 3.4.14: First, generate the list of training predictions for your model. Next, create a DataFrame df_predictions with the true values y_test and your predictions y_pred_wfv (don't forget the index). Finally, plot df_predictions using plotly express. Make sure that the y-axis is labeled "P2".

    • Generate in-sample predictions for a model in statsmodels.
    • Create a DataFrame from a dictionary using pandas.
    • Create a line plot in pandas.
    [ ]:
    fig = px.line(df_predictions,labels={"Value":"PM2.5"})

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    Pandas: Advanced

    Calculate Summary Statistics for a DataFrame or Series¶

    Many datasets are large, and it can be helpful to get a summary of information in them. Let's load a dataset and examine the first few rows:

    [ ]:
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

    Let's get a summary description of this dataset.

    [ ]:
    mexico_city1.describe()

    Like most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:

    [ ]:
        ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1

    Let's take a look at our new, cleaner dataset.

    [ ]:
    mexico_city1.describe()

    Practice

    Reload the mexico-city-real-estate-1.csv dataset. Reverse the sequence of operations by first dropping all rows where there is a missing value, and then dropping the columns, floor, price_usd_per_m2,expenses and rooms. What is the size of the resulting DataFrame?

    [ ]:
    mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")

    Select a Series from a DataFrame¶

    Since the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the mexico-city-real-estate-2 dataset, and examine the first few rows to find the column names.

    [ ]:
    mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv")

    Maybe we're interested in the surface_covered_in_m2 column. The code to extract just that one column looks like this:

    [ ]:
    surface_covered_in_m2 = mexico_city2["surface_covered_in_m2"]

    Practice

    Select the price series from the mexico-city-real-estate-2 dataset, and load it into the mexico_city2 DataFrame

    [ ]:
    print(price)

    Subset a DataFrame by Selecting One or More Columns¶

    You may find it more efficient to work with a smaller portion of a dataset that's relevant to you. One way to do this is to select some columns from a DataFrame and make a new DataFrame with them. Let's load a dataset to do this and examine the first few rows to find the column headings:

    [ ]:
    mexico_city4 = pd.read_csv("./data/mexico-city-real-estate-4.csv")

    Let's choose "operation", "property_type", "place_with_parent_names", and "price":

    [ ]:
        ["operation", "property_type", "place_with_parent_names", "price"]

    Once we've done that, we can find the resulting number of entries in the DataFrame:

    [ ]:
    mexico_city4_subset.shape

    Practice

    Load the mexico-city-real-estate-1.csv dataset and subset it to obtain the operation, lat-lon and place_with_property_names columns only. How many entries are in the resulting DataFrame?

    [ ]:
        ["operation", "lat-lon", "place_with_parent_names"]

    Subset the Columns of a DataFrame Based on Data Types¶

    It's helpful to be able to find specific types of entries — typically numeric ones — and put all of these in a separate DataFrame. First, let's take a look at the mexico-city-real-estate-5 dataset.

    [ ]:
    mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")

    The code to subset just the numerical entries looks like this:

    [ ]:
    mexico_city5_numbers = mexico_city5.select_dtypes(include="number")

    Practice

    Create a subset of the DataFrame from mexico-city-real-estate-5 which excludes numbers.

    [ ]:
    print(mexico_city3_no_numbers.shape)

    Working with value_counts in a Series¶

    In order to use the data in a series for other types of analysis, it might be helpful to know how often each value occurs in the Series. To do that, we use the value_counts method to aggregate the data. Let's take a look at the number of properties associated with each department in the colombia-real-estate-1 dataset.

    [ ]:
    df1 = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["department"])

    Practice

    Try it yourself! Aggregate the different property types in the colombia-real-estate-2 dataset.

    [ ]:
    df2 = ...

    Series and Groupby¶

    Large Series often include data points that have some attribute in common, but which are nevertheless not grouped together in the dataset. Happily, pandas has a method that will bring these data points together into groups.

    Let's take a look at the colombia-real-estate-1 dataset. The set includes properties scattered across Colombia, so it might be useful to group properties from the same department together; to do this, we'll use the groupby method. The code looks like this:

    [ ]:
    dept_group = df1.groupby("department")

    To make sure we got all the departments in the dataset, let's print the first occurrence of each department.

    [ ]:
    dept_group.first()

    Practice

    Try it yourself! Group the properties in colombia-real-estate-2 by department, and print the result.

    [ ]:
    dept_group.first()

    Now that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the get_group method to do that. If we just wanted to see the properties in "Santander", for example, the code would look like this:

    [ ]:
    dept_group1 = df1.groupby("department")

    We can also make groups based on more than one category by adding them to the groupby method. After resetting the df1 DataFrame, here's what the code looks like if we want to group properties both by department and by property_type.

    [ ]:
    dept_group2 = df1.groupby(["department", "property_type"])

    Practice

    Try it yourself! Group the properties in colombia-real-estate-2 by department and property type, and print the result.

    [ ]:
    dept_group.first()

    Finally, it's possible to use groupby to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the .mean() method. This is what the code for that looks like:

    [ ]:
    dept_group = df1.groupby("department")["area_m2"].mean()

    Practice Try it yourself! Use the .mean method in the colombia-real-estate-2 dataset to find the average price in Colombian pesos ("price_cop") for properties in each "department".

    [ ]:
    dept_group = ...

    Subset a DataFrame's Columns Based on the Column Data Types¶

    It's helpful to be able to find entries of a certain type, typically numerical entries, and put all of these in a separate DataFrame. Let's load a dataset to see how this works:

    [ ]:
    mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")

    Now let's get only numerical entries:

    [ ]:
    mexico_city5_numbers = mexico_city5.select_dtypes(include="number")

    Let's now find all entries which are not numerical entries:

    [ ]:
    mexico_city5_no_numbers = mexico_city5.select_dtypes(exclude="number")

    Practice

    Create a subset of the DataFrame from mexico-city-real-estate-5.csv which excludes numbers. How many entries does it have?

    [ ]:
    print(mexico_city3_no_numbers.shape)

    Subset a DataFrame's columns based on column names¶

    To subset a DataFrame by column names, either define a list of columns to include or define a list of columns to exclude. Once you've done that, you can retain or drop the columns accordingly. For example, let's suppose we want to modify the mexico_city3 dataset and only retain the first three columns. Let's define two lists, one with the columns to retain and one with the columns to drop:

    [ ]:
    keep_cols = ["operation", "property_type", "place_with_parent_names"]

    Next, we'll explore both approaches to subset mexico_city3. To retain columns based on keep_cols:

    [ ]:
    mexico_city3_subsetted = mexico_city3[keep_cols]

    To drop columns in drop_cols:

    [ ]:
    mexico_city3_subsetted = mexico_city3.drop(columns=drop_cols)

    Practice

    Create a subset of the DataFrame from mexico-city-real-estate-3.csv which excludes the last two columns.

    [ ]:
    ​x
     

    Pivot Tables¶

    A pivot table allows us to aggregate and summarize a DataFrame across multiple variables. For example, let's suppose we wanted to calculate the mean of the price column in the mexico_city3 dataset for the different values in the property_type column:

    [ ]:
    mexico_city3_pivot = ...

    Subsetting with Masks¶

    Another way to create subsets from a larger dataset is through masking. Masks are ways to filter out the data you're not interested in so that you can focus on the data you are. For example, we might want to look at properties in Colombia that are bigger than 200 square meters. In order to create this subset, we'll need to use a mask.

    First, we'll reset our df1 DataFrame so that we can draw on all the data in its original form. Then we'll create a statement and then assign the result to mask.

    [ ]:
    df1 = pd.read_csv("data/colombia-real-estate-1.csv")

    Notice that mask is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as False; where they're bigger than 200, it evaluates to True.

    Once we have our mask, we can use it to select all the rows from df1 that evaluate as True.

    [ ]:
    df1[mask].head()

    Practice

    Try it yourself! Read colombia-real-estate-2 into a DataFrame named df2, and create a mask to select all the properties that are smaller than 100 square meters.

    [ ]:
    df2[mask].head()

    We can also create masks with multiple criteria using & for "and" and | for "or." For example, here's how we would find all properties in Atlántico that are bigger than 400 square meters.

    [ ]:
    mask = (df1["area_m2"] > 400) & (df1["department"] == "Atlántico")

    Practice

    Try it yourself! Create a mask for df2 to select all the properties in Tolima that are smaller than 150 square meters.

    [ ]:
    df2[mask].head()

    Reshape a DataFrame based on column values¶

    What's a pivot table?¶

    A pivot table allows you to quickly aggregate and summarize a DataFrame using an aggregation function. For example, to build a pivot table that summarizes the mean of the price_cop column for each of the unique categories in the property_type column in df2:

    [ ]:
    pivot1 = pd.pivot_table(df2, values="price_cop", index="property_type", aggfunc=np.mean)

    Practice

    Build a pivot table that summarizes the mean of the price_cop column for each of the unique departments in the department column in df2:

    [ ]:
    pivot2 = pd.pivot_table(df2, values="price_cop", index="department", aggfunc=np.mean)

    Combine multiple categories in a Series¶

    Categorical variables can be collapsed into a fewer number of categories. One approach is to retain the values of the most frequently observed values and collapse all remaining values into a single category. For example, to retain only the values of the top 10 most frequent categories in the department column and then collapse the other categories together, use value_counts to generate the count of the values:

    [ ]:
    df2["department"].value_counts()

    Next, select just the top 10 using the head() method, and select the column names by using the index attribute of the series:

    [ ]:
    top_10 = df2["department"].value_counts().head(10).index

    Finally, use the apply method and a lambda function to select only the values from the department column and collapse the remaining values into the value Other:

    [ ]:
    df2["department"] = df2["department"].apply(lambda x: x if x in top_10 else "Other")

    Practice

    Try it yourself! Retain the remaining top 5 most frequent values in the department column and collapse the remaining values into the category Other.

    [ ]:
    ​x
     

    Cross Tabulation¶

    The pandas crosstab function is a useful for working with grouped summary statistics for categorical data. It starts by picking two categorical columns, then defines one as the index and the other as the column. If the aggregate function and value column is not defined, crosstab will simply calculate the frequency of each combination by default. Let's see the example below from the Colombia real estate dataset.

    [ ]:
    df = pd.read_csv("data/colombia-real-estate-1.csv")

    The following function calculates the frequency of each combination for two variables, department and property_type, in the data set.

    [ ]:
    pd.crosstab(index=df["department"], columns=df["property_type"])

    From the previous example, you can see we have created a DataFrame with the index showing unique observation in variable department, while the columns are the unique observation for the variable property_type. Each cell shows how many data points there are for each combination of department type and property_type. For example, there are 8 apartments in department Antioquia.

    We can further specify a value column and an aggregate function, like in pivot_table, to conduct more complicated calculations for the two categorical variables. In the following example, we're looking at the average area size for different property types in the departments in Colombia.

    [ ]:
        columns=df["property_type"],

    Practice

    Create a cross tabulate calculating frequency of combinations from mexico-city-real-estate-3.csv using currency as the index and property_type as the column.

    [ ]:
    # Create `crosstab`

    Applying Functions to DataFrames and Series¶

    apply is a useful method for to using one function on all the rows or columns of a DataFrame efficiently. Let's take the following real estate dataset as an example:

    [ ]:
    df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])

    By specifying the function inside apply(), we can transform the whole DataFrame. For example, I am calculating the square root of each row at each column:

    [ ]:
    import numpy as np

    Note you can also specify the "axis" inside apply. By default, we have axis=0, which means we are applying the function to each column. We can also switch to axis=1 if we want to apply the function to each row. See the following example showing the sum of all rows for each column:

    [ ]:
    df.apply(np.sum, axis=0)

    The following code will get the sum of all columns for each row:

    [ ]:
    df.apply(np.sum, axis=1)

    By specifying the column name, we can also apply the function to a specific column or columns. Note that we can also specify index (row names) to only apply functions to specific rows, however, it is not common in practice.

    [ ]:
    df["area_m2"].apply(np.sqrt)

    We can assign the results to a new column:

    [ ]:
    df["area_sqrt"] = df["area_m2"].apply(np.sqrt)

    Practice

    Try it yourself! Create a new column named 'sum_columns', which is the sum of all numerical columns in df:

    [ ]:
    df["sum_columns"] = ...

    df.aggregate(), or df.agg(), shares the same concept as df.apply() in terms of applying functions to a DataFrame, but df.aggregate() can only apply aggregate functions like sum, mean, min, max, etc. See the following example for more details:

    [ ]:
    df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])

    We can check what's the minimum number for each column calling the min aggregate function:

    [ ]:
    df.agg("min")

    Like apply(), we can also specify the axis argument to switch axis:

    [ ]:
    df.agg("min", axis=1)

    We can apply aggregate function to a whole DataFrame using df.agg(), or specify the column name for a subset of DataFrame:

    [ ]:
    df["area_m2"].agg("min")

    We can also apply a list of aggregate functions to a DataFrame:

    [ ]:
    df.agg(["sum", "max"])

    The syntax above will calculate both sum and max for each column, and store the result as index. Besides, we can also apply different aggregate functions to different columns. In this case, we need to pass a dictionary specifying key as column names, and value as corresponding aggregate function names:

    [ ]:
    df.agg({"area_m2": "sum", "price_cop": "min"})

    Practice

    Try it yourself! Find the minimum for column area_m2 and the maximum for column price_cop using df.agg():

    [ ]:
    ​x
     

    Working with Dates¶

    Time Stamps¶

    Pandas' time series capabilities are built on the Timestamp class. For instance, we can transform date strings of different formats into Timestamp:

    [ ]:
    print(pd.Timestamp("January 8, 2022"))

    We can also do date math with Timestamps. Note that when we subtract two dates, we get a special object called a Timedelta.

    [ ]:
    time_delta = pd.Timestamp("Feb. 11 2016 2:30 am") - pd.Timestamp("2015-08-03 5:14 pm")

    The pandas Timestamp class is also compatible with Python's datetime module.WQU WorldQuant University Applied Data Science Lab QQQQ

    [ ]:
    pd.Timestamp("Feb 11, 2016") - datetime.datetime(2015, 8, 3)

    We can manipulate dates using some function from offset(). We can use the Day() function to add or decrease days from a timestamp. The following function calculates the date that's four days prior to 9 January 2017:

    [ ]:
    from pandas.tseries.offsets import BDay, BMonthEnd, Day

    In some case, you might need to add or subtract only business days. That's when you use BDay():

    [ ]:
    print(pd.Timestamp("January 9, 2017") - BDay(4))

    We can also do an offset at the monthly level. For example, you can use BMonthEnd(n) to find the last business day 𝑛n months later. The following function shows the last business day of January 2017:

    [ ]:
    print(pd.Timestamp("January 9, 2017") + BMonthEnd(0))

    The following function shows the last business day for May 2017, which is four months after February:

    [ ]:
    print(pd.Timestamp("February 9, 2017") + BMonthEnd(4))

    We can also use the strptime function inside datetime to transform string to date format:

    [ ]:
    date = datetime.strptime("16 Oct 2022 12:14:05", "%d %b %Y %H:%M:%S")

    We can further transform it into the ISO 8601 format:

    [ ]:
    date.isoformat()

    Date Time Indices¶

    DatetimeIndex is a convenient function that transforms date-like data into readable Datetime format for a DataFrame index. That way, we can better plot and model time series data. Let's check an example. Here we have Apple's stock prices from 1999 10 2022.

    [ ]:
    data.set_index("date", inplace=True)

    Even though the index looks like dates, it is not in date format. So when we plot the data, the index doesn't follow the sequence of years, and the x-axis ticks are hard to read.

    [ ]:
    data["open"].plot()

    You can see the index is not in date format.

    [ ]:
    data.index[:5]

    We can use the DatetimeIndex function to transform it into date:

    [ ]:
    pd.DatetimeIndex(data.index)

    And we can set it as the index:

    [ ]:
    data.index = pd.DatetimeIndex(data.index)

    Now we can see the benefit from plotting:

    [ ]:
    data["open"].plot()

    Date Ranges¶

    If we're entering time series data into a DataFrame, it will often be useful to create a range of dates using date_range. We can create it with different frequencies by specifying freq. Here are the days in a specific range:

    [ ]:
    pd.date_range(start="1/8/2022", end="3/2/2022", freq="D")

    Here are the business dates for a specific range:

    [ ]:
    pd.date_range(start="1/8/2022", end="3/2/2022", freq="B")

    References & Further Reading¶

    • Pandas Documentation on Dropping a Column in a DataFrame
    • Pandas Documentation on Printing out the First Few Rows of a DataFrame
    • Pandas Documentation on Reading in a CSV File Into a DataFrame
    • Getting Started with Pandas Documentation
    • Pandas Documentation on Extracting a Column to a Series
    • Pandas Official Indexing Guide
    • Series in pandas
    • Tutorial for value_counts
    • Tutorial for groupby
    • pandas Documentation for groupby
    • Pandas Official Indexing Guide
    • Online Examples of Selecting Numeric Columns of a DataFrame
    • Official Pandas Documentation on Data Types in a DataFrame
    • Pandas documentation for Boolean indexing
    • More information on creating various kinds of subsets
    • More on boolean indexing
    • A tutorial on using various criteria to create subsets
    • Pandas.DataFrame.apply
    • Pandas.DataFrame.aggregate

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    3.5. Air Quality in Dar es Salaam 🇹🇿

    [1]:
    warnings.simplefilter(action="ignore", category=FutureWarning)
    [21]:
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    [3]:
    pp = PrettyPrinter(indent = 2)

    Prepare Data¶

    Connect¶

    Task 3.5.1: Connect to MongoDB server running at host "localhost" on port 27017. Then connect to the "air-quality" database and assign the collection for Dar es Salaam to the variable name dar.

    [4]:
    client = MongoClient(host="localhost",port=27017)
    system.views
    lagos
    system.buckets.lagos
    nairobi
    system.buckets.nairobi
    dar-es-salaam
    system.buckets.dar-es-salaam
    
    [23]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.1", [dar.name])

    You got it. Dance party time! 🕺💃🕺💃

    Score: 1

    Explore¶

    Task 3.5.2: Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection. Your submission should be a list of integers. WQU WorldQuant University Applied Data Science Lab QQQQ

    [6]:
    sites = dar.distinct("metadata.site")
    [6]:
    [23, 11]
    [25]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.2", sites)

    Party time! 🎉🎉🎉

    Score: 1

    Task 3.5.3: Determine which site in the Dar es Salaam collection has the most sensor readings (of any type, not just PM2.5 readings). You submission readings_per_site should be a list of dictionaries that follows this format:

    [{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
    

    Note that the values here ☝️ are from the Nairobi collection, so your values will look different.

    [7]:
            {"$group":{"_id":"$metadata.site","count":{"$count":{}}}}
    [7]:
    [{'_id': 23, 'count': 60020}, {'_id': 11, 'count': 138412}]
    [28]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.3", readings_per_site)

    Wow, you're making great progress.

    Score: 1

    Import¶

    Task 3.5.4: (5 points) Create a wrangle function that will extract the PM2.5 readings from the site that has the most total readings in the Dar es Salaam collection. Your function should do the following steps:

    1. Localize reading time stamps to the timezone for "Africa/Dar_es_Salaam".
    2. Remove all outlier PM2.5 readings that are above 100.
    3. Resample the data to provide the mean PM2.5 reading for each hour.
    4. Impute any missing values using the forward-will method.
    5. Return a Series y.
    [8]:
    result = dar.find_one({})
    { 'P1': 24.3,
      '_id': ObjectId('6334b0ef8c51459f9b1bffb7'),
      'metadata': { 'lat': -6.818,
                    'lon': 39.285,
                    'measurement': 'P1',
                    'sensor_id': 29,
                    'sensor_type': 'SDS011',
                    'site': 11},
      'timestamp': datetime.datetime(2018, 1, 1, 0, 0, 4, 53000)}
    
    [9]:
        df.index = df.index.tz_localize("UTC").tz_convert("Africa/Dar_es_Salaam")

    Use your wrangle function to query the dar collection and return your cleaned results.

    [10]:
    y = wrangle(dar)
    [10]:
    timestamp
    2018-01-01 03:00:00+03:00    9.456327
    2018-01-01 04:00:00+03:00    9.400833
    2018-01-01 05:00:00+03:00    9.331458
    2018-01-01 06:00:00+03:00    9.528776
    2018-01-01 07:00:00+03:00    8.861250
    Freq: H, Name: P2, dtype: float64
    [38]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.4", wrangle(dar))

    Yup. You got it.

    Score: 1

    Explore Some More¶

    Task 3.5.5: Create a time series plot of the readings in y. Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels".

    [11]:
    y.plot(xlabel="Date",ylabel="PM2.5 Level",title="Dar es Salaam PM2.5 Levels",ax=ax)
    [40]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.5", file)

    Very impressive.

    Score: 1

    Task 3.5.6: Plot the rolling average of the readings in y. Use a window size of 168 (the number of hours in a week). Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels, 7-Day Rolling Average".

    [12]:
    y.rolling(168).mean().plot(xlabel = "Date",ylabel="PM2.5 Level",ax=ax);
    [42]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.6", file)

    Awesome work.

    Score: 1

    Task 3.5.7: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, ACF".

    [13]:
    plt.title("Dar es Salaam PM2.6 Readings, ACF")
    [46]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.7", file)

    Boom! You got it.

    Score: 1

    Task 3.5.8: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, PACF".

    [14]:
    plt.title("Dar es Salaam PM2.6 Readings, ACF")
    [48]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.8", file)

    Boom! You got it.

    Score: 1

    Split¶

    Task 3.5.9: Split y into training and test sets. The first 90% of the data should be in your training set. The remaining 10% should be in the test set.

    [15]:
    print("y_train shape:", y_train.shape)
    y_train shape: (1533,)
    y_test shape: (171,)
    
    [50]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.9a", y_train)

    Yes! Keep on rockin'. 🎸That's right.

    Score: 1

    [52]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.9b", y_test)

    🥳

    Score: 1

    Build Model¶

    Baseline¶

    Task 3.5.10: Establish the baseline mean absolute error for your model.

    [16]:
    mae_baseline = mean_absolute_error(y_train,y_pred_baseline)
    Mean P2 Reading: 8.617582545265428
    Baseline MAE: 4.07658759405218
    
    [54]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.10", [mae_baseline])

    Correct.

    Score: 1

    Iterate¶

    Task 3.5.11: You're going to use an AR model to predict PM2.5 readings, but which hyperparameter settings will give you the best performance? Use a for loop to train your AR model on using settings for p from 1 to 30. Each time you train a new model, calculate its mean absolute error and append the result to the list maes. Then store your results in the Series mae_series.

    [26]:
            mae = mean_absolute_error(y_train.iloc[p:],y_pred)
    [26]:
    1    0.947888
    2    0.933894
    3    0.920850
    4    0.920153
    5    0.919519
    Name: mae, dtype: float64
    [27]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.11", mae_series)

    Awesome work.

    Score: 1

    Task 3.5.12: Look through the results in mae_series and determine what value for p provides the best performance. Then build and train final_model using the best hyperparameter value.

    Note: Make sure that you build and train your model in one line of code, and that the data type of best_model is statsmodels.tsa.ar_model.AutoRegResultsWrapper.

    [28]:
    best_model = AutoReg(y_train,lags=best_p).fit()
    [29]:
        "Project 3 Assessment", "Task 3.5.12", [isinstance(best_model.model, AutoReg)]

    Python master 😁

    Score: 1

    Task 3.5.13: Calculate the training residuals for best_model and assign the result to y_train_resid. Note that your name of your Series should be "residuals".

    [30]:
    y_train_resid.name = "residuals"
    [30]:
    timestamp
    2018-01-02 09:00:00+03:00   -0.560158
    2018-01-02 10:00:00+03:00   -2.193105
    2018-01-02 11:00:00+03:00   -0.026408
    2018-01-02 12:00:00+03:00    0.820217
    2018-01-02 13:00:00+03:00   -0.078009
    Freq: H, Name: residuals, dtype: float64
    [31]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.13", y_train_resid.tail(1500))

    Party time! 🎉🎉🎉

    Score: 1

    Task 3.5.14: Create a histogram of y_train_resid. Be sure to label the x-axis as "Residuals" and the y-axis as "Frequency". Use the title "Best Model, Training Residuals".

    [33]:
    plt.title("Best Model, Training Residuals")
    [34]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.14", file)

    Boom! You got it.

    Score: 1

    Task 3.5.15: Create an ACF plot for y_train_resid. Be sure to label the x-axis as "Lag [hours]" and y-axis as "Correlation Coefficient". Use the title "Dar es Salaam, Training Residuals ACF".

    [35]:
    plt.title("Dar es Salaam, Training Residuals ACF")
    [36]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.15", file)

    Very impressive.

    Score: 1

    Evaluate¶

    Task 3.5.16: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Make sure the name of your Series is "prediction" and the name of your Series index is "timestamp".

    [39]:
        history = history.append(y_test[next_pred.index])
    [39]:
    timestamp
    2018-03-06 00:00:00+03:00    8.049707
    2018-03-06 01:00:00+03:00    8.676565
    2018-03-06 02:00:00+03:00    6.271593
    2018-03-06 03:00:00+03:00    6.319877
    2018-03-06 04:00:00+03:00    7.163557
    Freq: H, Name: prediction, dtype: float64
    [38]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.16", y_pred_wfv)

    That's the right answer. Keep it up!

    Score: 1

    Task 3.5.17: Submit your walk-forward validation predictions to the grader to see test mean absolute error for your model.

    [40]:
    wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    Cell In [40], line 1
    ----> 1 wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, submission)
        177 def grade(assessment_id, question_id, submission):
        178   submission_object = {
        179     'type': 'simple',
        180     'argument': [submission]
        181   }
    --> 182   return show_score(grade_submission(assessment_id, question_id, submission_object))
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:146, in grade_submission(assessment_id, question_id, submission_object)
        144     raise Exception('Grader raised error: {}'.format(error['message']))
        145   else:
    --> 146     raise Exception('Could not grade submission: {}'.format(error['message']))
        147 result = envelope['data']['result']
        149 # Used only in testing
    
    Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET submission API: You have already passed this course!

    Communicate Results¶

    Task 3.5.18: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express. Be sure to label the x-axis as "Date" and the y-axis as "PM2.5 Level". Use the title "Dar es Salaam, WFV Predictions".

    [42]:
    fig.write_image("images/3-5-18.png", scale=1, height=500, width=700)
    [43]:
        wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    Cell In [43], line 2
          1 with open("images/3-5-18.png", "rb") as file:
    ----> 2     wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, submission)
        177 def grade(assessment_id, question_id, submission):
        178   submission_object = {
        179     'type': 'simple',
        180     'argument': [submission]
        181   }
    --> 182   return show_score(grade_submission(assessment_id, question_id, submission_object))
    
    File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:146, in grade_submission(assessment_id, question_id, submission_object)
        144     raise Exception('Grader raised error: {}'.format(error['message']))
        145   else:
    --> 146     raise Exception('Could not grade submission: {}'.format(error['message']))
        147 result = envelope['data']['result']
        149 # Used only in testing
    
    Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET submission API: You have already passed this course!

    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    ​x

    Usage Guidelines

    This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

    This means:

    • ⓧ No downloading this notebook.
    • ⓧ No re-sharing of this notebook with friends or colleagues.
    • ⓧ No downloading the embedded videos in this notebook.
    • ⓧ No re-sharing embedded videos with friends or colleagues.
    • ⓧ No adding this notebook to public or private repositories.
    • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

    Time Series: Core Concepts

    [ ]:
    from IPython.display import YouTubeVideo

    Model Types¶

    Autoregression Models¶

    Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works in a similar way to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.

    ARMA Models¶

    ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.

    Below is a video from Ritvik Kharkar that explains MA models in terms of cupcakes and crazy professors — two things we love!

    [ ]:
    YouTubeVideo("voryLhxiPzE")

    And in this video, Ritvik talks about the ARMA model we use in Project 3.

    [ ]:
    YouTubeVideo("HhvTlaN06AM")

    Plots¶

    ACF Plot¶

    When we've worked with autocorrelations in the past, we've treated them like static relationships, but that's not always how they work. Sometimes, we want to actually see how those autocorrelations change over time, which means we need to think of them as functions. When we create a visual representation of an autocorrelation function (ACF), we're making an ACF plot.

    PACF Plot¶

    Autocorrelations take into account two types of observations. Direct observations are the ones that happen exactly at our chosen time-step interval; we might have readings at one-hour intervals starting at 1:00. Indirect observations are the ones that happen between our chosen time-step intervals, at time-steps like 1:38, 2:10, 3:04, etc. Those indirect observations might be helpful, but we can't be sure about that, so it's a good idea to strip them out and see what our graph looks like when it's only showing us direct observations.

    An autocorrelation that only includes the direct observations is called a partial autocorrelation, and when we view that partial autocorrelation as a function, we call it a PACF.

    PACF plots represent those things visually. We want to compare our ACF and PACF plots to see which model best describes our time series. If the ACF data drops off slowly, then that's going to be a better description; if the PACF falls off slowly, then that's going to be a better description.

    Statistical Concepts¶

    Walk-Forward Validation¶

    Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past.

    Parameters¶

    Parameters are the parts of the model that are learned from the training data.

    Hyperparameters¶

    We've already seen that parameters are elements that a machine learning model learns from the training data. Hyperparameters, on the other hand, are elements of the model that come from somewhere else. Data scientists choose hyperparameters either by examining the data themselves, or by creating some kind of automated testing of different options to see how they perform. Hyperparameters are set before the model is trained, which means that they significantly impact how the model is trained, and how it subsequently performs. One way of thinking about the difference between the two is that parameters come from inside the model, and hyperparameters come from outside the model.

    When we think about hyperparameters, we think in terms of p values and q values. p values represent the number of lagged observations included in the model, and the q is the size of the moving average window. These values count as hyperparameters because we get to decide what they are. How many lagged observations do we want to include? How big should our window of moving averages be?

    Rolling Averages¶

    A rolling average is the mean value of multiple subsets of numbers in a dataset. For example, I might have data relating to the daily income for a shop I own, and as long as the shop stays open, I can calculate a rolling average. On Friday, I might calculate the average income from Monday-Thursday. The next Monday, I might calculate the average income from Tuesday-Friday, and the next day, I might calculate the average income from Wednesday to Monday, and so on. These averages roll, giving me a sense for how the data is changing in relation to any kind of static construct. In this case, and in many data science applications, that construct is time. Calculating rolling averages is helpful for making accurate forecasts about the ways data will change in the future.


    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ


    Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

    • 031-data-wrangling-with-mongodb.ipynb
    • 032-linear-regression-with-time-series-data.ipynb
    • 033-autoregressive-models.ipynb
    • 034-arma-models-and-hyperparameter-tuning.ipynb
    • 17-ts-core.ipynb
    • 035-assignment.ipynb
    • 04-pandas-advanced.ipynb
    • 18-ts-models.ipynb
    • 11-databases-mongodb.ipynb
    • 10-databases-sql.ipynb
    xxxxxxxxxx
    <font size="+3"><strong>3.3. Autoregressive Models</strong></font>
    Advanced Tools
    xxxxxxxxxx
    xxxxxxxxxx

    -

    Variables

    Callstack

      Breakpoints

      Source

      xxxxxxxxxx
      1
      033-autoregressive-models.ipynb
      • 1. Prepare Data
      • 1.1. Import
      • 1.2. Explore
      • 1.3. Split
      • 2. Build Model
      • 2.1. Baseline
      • 2.2. Iterate
      • 2.3. Evaluate
      • 3. Communicate Results
        0
        9
        Python 3 (ipykernel) | Idle
        Saving completed
        Uploading…
        033-autoregressive-models.ipynb
        English (United States)
        Spaces: 4
        Ln 1, Col 1
        Mode: Command
        • Console
        • Change Kernel…
        • Clear Console Cells
        • Close and Shut Down…
        • Insert Line Break
        • Interrupt Kernel
        • New Console
        • Restart Kernel…
        • Run Cell (forced)
        • Run Cell (unforced)
        • Show All Kernel Activity
        • Debugger
        • Continue
          Continue
          F9
        • Evaluate Code
          Evaluate Code
        • Next
          Next
          F10
        • Step In
          Step In
          F11
        • Step Out
          Step Out
          Shift+F11
        • Terminate
          Terminate
          Shift+F9
        • Extension Manager
        • Enable Extension Manager
        • File Operations
        • Autosave Documents
        • Open from Path…
          Open from path
        • Reload Notebook from Disk
          Reload contents from disk
        • Revert Notebook to Checkpoint
          Revert contents to previous checkpoint
        • Save Notebook
          Save and create checkpoint
          Ctrl+S
        • Save Notebook As…
          Save with new path
          Ctrl+Shift+S
        • Show Active File in File Browser
        • Trust HTML File
        • Help
        • About JupyterLab
        • Jupyter Forum
        • Jupyter Reference
        • JupyterLab FAQ
        • JupyterLab Reference
        • Launch Classic Notebook
        • Licenses
        • Markdown Reference
        • Reset Application State
        • Image Viewer
        • Flip image horizontally
          H
        • Flip image vertically
          V
        • Invert Colors
          I
        • Reset Image
          0
        • Rotate Clockwise
          ]
        • Rotate Counterclockwise
          [
        • Zoom In
          =
        • Zoom Out
          -
        • Kernel Operations
        • Shut Down All Kernels…
        • Launcher
        • New Launcher
        • Main Area
        • Activate Next Tab
          Ctrl+Shift+]
        • Activate Next Tab Bar
          Ctrl+Shift+.
        • Activate Previous Tab
          Ctrl+Shift+[
        • Activate Previous Tab Bar
          Ctrl+Shift+,
        • Activate Previously Used Tab
          Ctrl+Shift+'
        • Close All Other Tabs
        • Close All Tabs
        • Close Tab
          Alt+W
        • Close Tabs to Right
        • Find Next
          Ctrl+G
        • Find Previous
          Ctrl+Shift+G
        • Find…
          Ctrl+F
        • Log Out
          Log out of JupyterLab
        • Presentation Mode
        • Show Header Above Content
        • Show Left Sidebar
          Ctrl+B
        • Show Log Console
        • Show Right Sidebar
        • Show Status Bar
        • Shut Down
          Shut down JupyterLab
        • Simple Interface
          Ctrl+Shift+D
        • Notebook Cell Operations
        • Change to Code Cell Type
          Y
        • Change to Heading 1
          1
        • Change to Heading 2
          2
        • Change to Heading 3
          3
        • Change to Heading 4
          4
        • Change to Heading 5
          5
        • Change to Heading 6
          6
        • Change to Markdown Cell Type
          M
        • Change to Raw Cell Type
          R
        • Clear Outputs
        • Collapse All Code
        • Collapse All Outputs
        • Collapse Selected Code
        • Collapse Selected Outputs
        • Copy Cells
          C
        • Cut Cells
          X
        • Delete Cells
          D, D
        • Disable Scrolling for Outputs
        • Enable Scrolling for Outputs
        • Expand All Code
        • Expand All Outputs
        • Expand Selected Code
        • Expand Selected Outputs
        • Extend Selection Above
          Shift+K
        • Extend Selection Below
          Shift+J
        • Extend Selection to Bottom
          Shift+End
        • Extend Selection to Top
          Shift+Home
        • Insert Cell Above
          A
        • Insert Cell Below
          B
        • Merge Cell Above
          Ctrl+Backspace
        • Merge Cell Below
          Ctrl+Shift+M
        • Merge Selected Cells
          Shift+M
        • Move Cells Down
        • Move Cells Up
        • Paste Cells Above
        • Paste Cells and Replace
        • Paste Cells Below
          V
        • Redo Cell Operation
          Shift+Z
        • Run Selected Cells
          Shift+Enter
        • Run Selected Cells and Don't Advance
          Ctrl+Enter
        • Run Selected Cells and Insert Below
          Alt+Enter
        • Run Selected Text or Current Line in Console
        • Select Cell Above
          K
        • Select Cell Below
          J
        • Split Cell
          Ctrl+Shift+-
        • Undo Cell Operation
          Z
        • Notebook Operations
        • Change Kernel…
        • Clear All Outputs
        • Close and Shut Down
        • Collapse All Cells
        • Deselect All Cells
        • Enter Command Mode
          Ctrl+M
        • Enter Edit Mode
          Enter
        • Expand All Headings
        • Interrupt Kernel
        • New Console for Notebook
        • New Notebook
          Create a new notebook
        • Reconnect To Kernel
        • Render All Markdown Cells
        • Restart Kernel and Clear All Outputs…
        • Restart Kernel and Run All Cells…
        • Restart Kernel and Run up to Selected Cell…
        • Restart Kernel…
        • Run All Above Selected Cell
        • Run All Cells
        • Run Selected Cell and All Below
        • Select All Cells
          Ctrl+A
        • Toggle All Line Numbers
          Shift+L
        • Toggle Collapse Notebook Heading
          T
        • Trust Notebook
        • Settings
        • Advanced Settings Editor
          Ctrl+,
        • Show Contextual Help
        • Show Contextual Help
          Live updating code documentation from the active kernel
          Ctrl+I
        • Spell Checker
        • Choose spellchecker language
        • Toggle spellchecker
        • Terminal
        • Decrease Terminal Font Size
        • Increase Terminal Font Size
        • New Terminal
          Start a new terminal session
        • Refresh Terminal
          Refresh the current terminal session
        • Use Terminal Theme: Dark
          Set the terminal theme
        • Use Terminal Theme: Inherit
          Set the terminal theme
        • Use Terminal Theme: Light
          Set the terminal theme
        • Text Editor
        • Decrease Font Size
        • Increase Font Size
        • Indent with Tab
        • New Markdown File
          Create a new markdown file
        • New Python File
          Create a new Python file
        • New Text File
          Create a new text file
        • Spaces: 1
        • Spaces: 2
        • Spaces: 4
        • Spaces: 8
        • Theme
        • Decrease Code Font Size
        • Decrease Content Font Size
        • Decrease UI Font Size
        • Increase Code Font Size
        • Increase Content Font Size
        • Increase UI Font Size
        • Theme Scrollbars
        • Use Theme: JupyterLab Dark
        • Use Theme: JupyterLab Light